Modernizing Institutional Grant Data Analytics using Cloud Services

Project phases

Published: May 27, 2024

Last Updated: 5 months ago.

View All Projects

Presenting data to demonstrate the effectiveness of student investments is critical. At the University of British Columbia, the Teaching and Learning Enhancement Fund (TLEF) is financed entirely from a portion of UBC Vancouver student tuition and is intended to enrich student learning by supporting innovative and effective educational enhancements. The Projects & Faculty Partnerships team who provide oversight and management of the fund, wanted to transform the accessibility of TLEF data into a compelling narrative tool and influence the discoverability of TLEF projects, allowing both students and faculty to make connections between different initiatives.

Approach

To address this challenge, the UBC CIC developed a prototype that changed the way the data is ingested and transformed to help the Projects and Faculty Partnerships team work with the data to better measure the reach and, in turn, help them share information in a more meaningful way. The solution leverages AWS technology to move data storage to the cloud, and by presenting data with advanced filtering and more dynamic views.

Screenshots of UI

Homepage

Project Summary

Snapshot

Technical Details

This section provides an overview of the components and services we used to develop the Proof of Concept.

The solution consists of two main parts:

  1. The data processing / storage pipeline
  2. The web application and its staging view for administrators

The data processing / storage pipeline retrieves project data supplied by administrators, then preprocesses and cleans the contents for use by the web application. The web application consists of a public-facing UI where users may dynamically filter and search on data, and a staging UI that allows administrators to verify the contents of updated datasets and confirm changes.

Data Processing Pipeline

Cleaning Data

Raw datasets from external sources are uploaded and stored in Amazon S3. AWS Glue retrieves these contents and executes data preprocessing scripts to create structured Excel files for faculty engagement and project details. These project summaries are then processed to generate embeddings and identify similar projects based on these embeddings.

Generating New Grant IDs

This Glue job handles the generation of unique grant IDs that can be used to refer to a project in case there is no available project ID. This job also ensures that all entries in the summary column are strings. After this data processing is complete, this job automatically calls the next Glue job that generates the similar projects dataset.

Generating Embeddings and Similar Projects

This Glue job is responsible for the processing and analysis of project summaries to generate embeddings and identify similar projects based on these embeddings, followed by saving the output. The steps are outlined as follows:

  1. Iterate through each project, generate embeddings for the project summaries, and store these embeddings in a specified S3 bucket.
  2. Retrieve all existing embedding from the specified S3 bucket and compile them into a local dictionary for further processing.
  3. Compute similarity scores between all project embeddings and identify the most similar projects for each project.
  4. Save the similar projects information back to S3 for future retrieval and analysis.

Web Application

Users are able to search proposals by their title or principal investigator. Projects are also filterable by funding year, project type, faculty/college/unit, and focus area.

Individual summary pages can be accessed by clicking on the title of the proposal of interest within the data table. Users can view an overview of the project, similar projects, and project outcomes.

The Snapshot page generates a program summary and gives statistics and visualizations of the data. These include the success rate, number of grants and projects, funding awarded, student reach, and faculty and student engagement  Filters applied to the data are preserved when navigating between the Snapshot and home pages.

Admin Staging View

The staging website provides a controlled environment for reviewing changes to the data before they are made public. Within the staging website, data that is currently staged will be displayed. There are two additional features: confirming the new data changes (which will publish data to the main page), and logging out from the staging view.

Architecture Diagram

Architecture diagram

Step-by-step

The following includes a simplified walkthrough of the solution and describes how all of the components interact.

Data Preparation

  1. Administrators upload files (datasets, posters, reports) to appropriate locations in Amazon S3.
  2. Datasets will be automatically converted from Excel format to Parquet format and be ready for ingestion.
  3. Converted datasets will be transferred to the /staging folder, where all staged datasets are stored.
  4. External data will be retrieved by the first Glue job and go through the cleaning process. As a result of this process, two new files “project_details” and “faculty_engagement” will be generated.
  5. The generated files “project_details” and “faculty_engagement” will then be reviewed/edited by administrators. Once the reviewing process is finished, the “faculty_engagement” file will be converted to Parquet and transferred to the /staging folder. “project_details”, on the other hand, requires additional operations. The file will be passed in to the second Glue job to generate new IDs in case some records in the dataset do not have them. After the process, a file named “project_details_with_new_ids” will be generated and stored in the /staging folder in Parquet format.
  6. The product of the previous step will be reused in the third Glue job, where similarity scores between all projects will be computed and stored. At the end of the job, it generates a “similar_projects” file, which maps all projects and their similar projects.
  7. The “similar_projects file” will then be stored in the /staging folder.

Web Application

  1. Uploaded images will be accessed via AWS CloudFront. CloudFront accesses private resources (posters, reports) stored inside Amazon S3, and distributes them to the user.
  2. Generated URLs to the images will then be passed to the frontend, and displays the thumbnail of images and provides clickable links to the user. Users can view and download these images.
  3. To preview staged data, administrators will need to access the /staging endpoint.
  4. The application prompts the administrator to login. This feature is implemented with Amazon Cognito user pool. Administrators can log in to the staging website using the username and password they created on the Amazon Cognito console.
  5. The React website hosted on AWS Amplify makes API calls to retrieve data from the backend. AWS WAF controls HTTP traffic between the website and the GraphQL API endpoint to protect backend resources activities.
  6. The GraphQL API define in AWS AppSync process allowed requests.
  7. AppSync communicates with resolvers running on AWS Lambda to retrieve data.
  8. AWS Lambda functions serve as GraphQL API resolvers. Using the boto3 Python Library, it interacts with AWS Athena to query from data files stored in S3.
  9. AWS Athena is triggered by Lambda functions, and runs SQL queries to fetch data. The query results will then be sent back to AWS Lambda and stored in the /result folder in S3 storage.
  10. One of the defined GraphQL queries triggers file transfer from /staging to /production to publish staged data into the production environment. Only authenticated users (administrators) can invoke it. The Lambda resolver that handles these specific queries copies the contents in the /staging folder, and transfers them to /production.
  11. Public users access the web application to view data. This does not require any login step, and they can view/filter data and generate graphs to visualize the data.

Link to the solution prototype on GitHub: https://github.com/UBC-CIC/grant-program-analytics

Visuals

Four-step infographic describing project.
1. Dynamic Search
2. Interact with Data
3. Explore Similar Projects
4. Summarize Projects

Demo Video

Acknowledgements

Photo by Prostock-Studio

About the University of British Columbia Cloud Innovation Centre (UBC CIC)

The UBC CIC is a public-private collaboration between UBC and Amazon Web Services (AWS). A CIC identifies digital transformation challenges, the problems or opportunities that matter to the community, and provides subject matter expertise and CIC leadership.

Using Amazon’s innovation methodology, dedicated UBC and AWS CIC staff work with students, staff and faculty, as well as community, government or not-for-profit organizations to define challenges, to engage with subject matter experts, to identify a solution, and to build a Proof of Concept (PoC). Through co-op and work-integrated learning, students also have an opportunity to learn new skills which they will later be able to apply in the workforce.