Projects | Data Science Club @ UVU

OVERVIEW

Data Science as a discipline encompasses a variety of fields, including Machine Learning, Data Engineering, Software Engineering, and Data Analytics. Because of this, employers expect Data Scientists to understand the core concepts surrounding these fields.

For our first collaborative project, we will be building off of Spotify's Million Playlist Dataset Challenge by creating a web app that allows users to choose, create, or import a custom playlist and receive a list of suggested songs to add to their playlist.

TEAMS

As mentioned above, this will be a collaborative project. Thus, we will be breaking out into the following teams:

Data Science & Software Engineering

Team Leader

Jacob Banuelos

Team Roles

Front-End Developer
Back-End Developer
Machine Learning Engineer
Unit Testing Specialist

Team Responsibilities

Build front-end UI
Build CI/CD pipeline
Deploy web app
Conduct unit testing
Build KNN or K-Means model to find clusters
Train and test KNN or K-Means model
Validate KNN or K-Means model

Data Engineering & Data Analytics

Team Leader

Chase Pattee

Team Roles

Database Administrator (DBA)
ELT Developer
Workflow Optimization Specialist
Query Optimization Specialist
Tableau Developer

Team Responsibilities

Build batch ELT pipeline
Deploy workflow orchestration tool
Build and deploy data storage tools
Create local DB for temporary data storage
Create back-end data warehouse
Create training and testing datasets
Create reporting dashboard for project metrics
Deploy Change Data Capture (CDC) techniques for data warehouse

TASKS

Now that we have specified what roles are necessary to complete this project, we will outline the individual tasks that need to be accomplished. To keep things organized, we will refer to the Data Science & Software Engineering Team as Team A and the Data Engineering & Data Analytics Team as Team B.

Phase 1: Data Warehouse Configuration

Create connection to Spotify API (Team A)
Create ingestion pipeline (Team B)
Create data model using dbt (Team B)
Design and implement data warehouse for data storage (Team B)
Deploy CDC features in data warehouse (Team B)

Phase 2: ELT Procedures

Design and implement batch ELT pipeline (Team B)
Create and deploy Prefect workflows (Team B)
Create indexes and views (Team B)

Phase 3: Machine Learning Model Development

Build KNN or K-Means model (Team A)
Create training dataset (Team B)
Traing KNN or K-Means model (Team A)
Identify KPIs used in reporting dashboard (Team B)
Create testing dataset (Team B)
Test and validate KNN or K-Means model (Team A)
Build front-end UI (Team A)
Build CI/CD pipeline (Team A)

Phase 4: Deploy and Test Web App

Conduct unit testing (Team A)
Deploy web app (Team A)
Create reporting dashboard (Team B)

TOOLS

To accomplish the tasks outlined above, we will be using the following tools (in order of appearance):

Spotipy (Python library)
Prefect
Snowflake*
dbt
Pandas (Python library)
PyTorch (Python library)*
Tableau
Streamlit (Python UI framework)
GitHub Actions
PyTest (Python library)*

* subject to change