DATA 301 Fall 2019

Table of contents

  1. Syllabus
  2. Schedule
  3. Technology
  4. Project

Project

Students will be able to select one of a small number of datasets. They will be asked to formulate their own questions and initial work answering those questions. More information will be provided as the project gets closer.

Project Options

Text: Questions

Data and Overview

Structured: Sports

Data and Overview

Social Good: PBS Kids

Data and Overview

Teams

Teams can consist of 2-3 people. Each team will have a GitHub repo for their project. Your individual grade is based on your contributions visible within GitHub.

Deliverables

Progress (40%)

Exploratory Data Analysis (10%)

Exploratory data analysis is very important for communicating and understanding a data science project. Without a decent understanding of your source material, it is difficult to troubleshoot problems downstream. You and your group must work together to push your understanding of the dataset. I am looking for an EDA from each of you, but they can share methods and approaches. Please make sure that each EDA is in a separate notebook for each person (though you can share portions). You can also use code you find online, but you must cite appropriately, AND you must extend it in a meaningful way. You’ll basically get 0 points for using someone else’s work without modification. You must provide explanations of your work. I should be able to read your notebook without getting a headache.

Preliminary results on main objective (10%)

Each dataset is back by a Kaggle competition. The main objective is defined as the main Kaggle objective and the related evaluation metric. You do NOT have to submit to the competition though you can if you want; however, you must evaluate yourself as if you were going to submit. For example, if they are going to measure your ability to predict something using the mean squared error, then I need you to evaluate yourself using the mean squared error. You must provide explanations of your work. I should be able to read your notebook without getting a headache. Like before, everyone must have their own notebook, but you may share and help the team.

Identification of a second objective (10%)

The team is expected to come up with a second objective that is significantly different from the main Kaggle defined objective. Try to think about what others would care about in this dataset. Can you propose a new problem from the data? One notebook per each group is fine here.

Preliminary results on second objective (10%)

Once you have selected a second objective you must each attempt to achieve preliminary results towards that objective. It is fine to submit a single notebook for the secondary objective.

Final Project (55%)

Your GitHub repo is your final project report. It should include notebooks, scripts, etc. I will specifically look and grade the following the four deliverables discussed in detail above (EDA, preliminary results on main objective, identification of second objective, preliminary results on second objective). Data science is very iterative, and I expect your final project submission to contain all the same major components as your in-progress project. I just expect the results, explanations, visualizations, etc to be better!

Final Project Video (5%)

In addition to the project itself, please create a 3-5 minute video describing the project and your results. Only one submission per group necessary. I will not make them public unless I obtain your permission. Though I recommend you make them polished enough to make public. You have free range to use the 3 minutes however you wish but you must discuss the data, methods, and results in some manner. Please upload the video to YouTube and include a video_link.txt in the root directory of your repo. I will share these videos with your classmates as it will be a great way to see how others are doing.

Due Dates

The due dates for the in-progress components are specified in the schedule. All submissions just need to be committed and pushed. The final project is due on the day of the final exam (December 10).