Introduction to Data Science: DATA 101

Syllabus

Official Course Description

Introduction to the use of computer based tools for the analysis of large data sets for the purpose of knowledge discovery. Students will learn to understand the Data Science process and the difference between deductive hypothesis-driven and inductive data-driven modeling. Students will have hands-on experience with various on-line analytical processing and data mining software and complete a project using real data.

Required Text

There is no requied textbook to purchase for this course, but there will be a lot of required reading and online tutorials and guides. As a reference, I would recommend this book: https://cran.r-project.org/doc/manuals/R-intro.pdf. We are also going to be using R for Data Science.

Our Plan

Every DATA 101 course, I search online to find a suitable real-world competition for students to compete in. We match this with an internal competition/judging with local judges from industry. It is a lot of fun. Ideally, this would line up with the end of the semester, but alas, this almost never happens. This time the competition I want us to participate in starts September 1st (https://goo.gl/Nptzgb). Students from this course have actually won this competition before and have been flown to SF and presented their work. Not to mention the cash prize (note: winning has nothing to do with your grade). Now I know putting the competition at the beginning of the semester means you won’t have a lot of experience to pull from, but there is no better way to focus our energies around a real data science task than a competition. Also, I have no way of knowing what the competition will entail ahead of time, so I can’t exactly prepare you with specific methods in any case. So what does this mean for the execution of the course? Here is a breakdown of our general plan:

  1. Competition will run from Sept 1st to Oct 15th. Everyone will be sorted into teams based on major, and every team is required to submit a solution.
  2. Since I don’t know what the competition content is ahead of time, I can’t prepare traditional lectures. Instead, I will work outside of the class to provide supplemental code and tutorials that will aid you in the competition. I might spend a few minutes in class discussing these from time to time, but the majority of class is dedicated to working in your groups. I will float between the teams helping out.
  3. You won’t be graded on how well you do in the competition, but you will be graded on the progress you make every week. A big part of being a data scientist is communicating your methods, progress, and results to various stakeholders. To this end, each group will need to submit a report each week. We will be using RMarkdown for this purpose (http://rmarkdown.rstudio.com/)
  4. After the submission of the projects to the official competition, we will hold our own judging at the end of the semester with local judges from local companies.

After the competition is over, we will revert to a more standard lecture+lab course setup. Targeted data science lessons on concepts discussed in class. The topics will vary from week to week. They are designed to build up your skills to accomplish the next two tasks. Each week will have a consistent schedule. The first class will be traditional lecture style with a heavy emphasis on interactive discussion, where I will go over the theory behind the algorithms and concepts. The second class will be mostly lab style. My goal is to be your guide as you gain experience being a data scientist. It is during this second class and out of the classroom that you will gain additional practical experience as a data scientist.

If you have laptops, please bring them to each class.

Course Details

Contact Information

  • Professor: Dr. Paul Anderson
  • Office: 313 HWEA
  • Office Hours: My preferred method of e-contact is the Facebook group as I can respond to questions there quickly and for everyone to benefit. If you would like to use e-mail I will endeavor to respond within 48 hours.
  • E-mail: andersonpe2@cofc.edu
  • Office Phone: 843-953-8151 (I never pick this up, but it does exist :)
  • Section 01 - TR: 3:35 pm - 04:50 pm in HWWE 211
  • Section 02 - TR: 2:10 pm - 03:25 pm in HWWE 211

Course (learning) outcomes

  • To gain an overview the field of knowledge discovery
  • To be able to distinguish and translate between data, information, and knowledge
  • To learn how to store, query, aggregate data in databases
  • To be able to distinguish problems based on computability
  • To learn how to implement distributed computing and storage
  • To apply algorithms for inductive and deductive reasoning
  • To learn introductory and state-of-the-art data mining algorithms
  • To apply data mining, statistical inference, and machine learning algorithms to a variety of datasets, including text, image, biological, and health
  • To apply information filtering on real world datasets
  • To apply information validation on real world datasets
  • To apply artificial intelligence concepts to real world datasets
  • To understand the social, ethical, and legal issues of informatics and data science

Grading Policy

  • Project/Competition Reports - 40%
  • Exam - 20%
  • Homework - 10%
  • Programming Assignments - 30%

Grading Scale: A: 90-100; B: 80-89; C: 70-79; F: <70. Plusses will be used at the discretion of the instructor.

Grading Guidelines: Submitted work requires Analysis, Evaluation, and Creation of ideas, concepts, and materials into various deliverables (e.g., see revised Bloom’s Taxonomy and reference below).

  • The grade of A is for work that involves high-quality achievement in all three Bloom areas.
  • The grade of B is for work that involves high-quality achievement in at least two Bloom areas, and medium-level achievement in the other.
  • The grade of C is for work that involves high-quality achievement in at least one Bloom area, and medium-level achievement in the others.
  • The grade of F is for work that does not meet above criteria.

Reference: Errol Thompson, Andrew Luxton-Reilly, Jacqueline L. Whalley, Minjie Hu, and Phil Robbins. 2008. Bloom’s taxonomy for CS assessment. In Proceedings of the tenth conference on Australasian computing education - Volume 78 (ACE ‘08), Simon Hamilton and Margaret Hamilton (Eds.), Vol. 78. Australian Computer Society, Inc., Darlinghurst, Australia, Australia, 155-161.

Feedback will be given as quickly as possible with a goal of within a week of the assignment due date.

Homework Policy

Written homework must be placed under my office door by 5 PM on the due date. No late homework will be accepted. Cheating/sharing will result in a zero on the assignment and a report to the judicial board.

Programming Assignments

Most programming assignments will be submitted through the Learn2Mine environment. There will be a combination of in-class lab assignments, and out of class programming assignments.

Honor Code

Lying, cheating, attempted cheating, and plagiarism are violations of our Honor Code that, when identified, are investigated. Each instance is examined to determine the degree of deception involved.

Incidents where the professor believes the student’s actions are clearly related more to ignorance, miscommunication, or uncertainty, can be addressed by consultation with the student. We will craft a written resolution designed to help prevent the student from repeating the error in the future. The resolution, submitted by form and signed by both the professor and the student, is forwarded to the Dean of Students and remains on file.

Cases of suspected academic dishonesty will be reported directly to the Dean of Students. A student found responsible for academic dishonesty will receive a XF in the course, indicating failure of the course due to academic dishonesty. This grade will appear on the student’s transcript for two years after which the student may petition for the X to be expunged. The student may also be placed on disciplinary probation, suspended (temporary removal) or expelled (permanent removal) from the College by the Honor Board.

It is important for students to remember that unauthorized collaboration–working together without permission– is a form of cheating. Unless a professor specifies that students can work together on an assignment and/or test, no collaboration is permitted. Other forms of cheating include possessing or using an unauthorized study aid (such as a PDA), copying from another’s exam, fabricating data, and giving unauthorized assistance.

Remember, research conducted and/or papers written for other classes cannot be used in whole or in part for any assignment in this class without obtaining prior permission from the professor.

Students can find a complete version of the Honor Code and all related processes in the Student Handbook at http://www.cofc.edu/studentaffairs/general_info/studenthandbook.html.

Disability Accomodations

Any student who feels he or she may need an accommodation based on the impact of a disability should contact me individually to discuss your specific needs. Also, please contact the College of Charleston, Center for Disability Services http://www.cofc.edu/~cds/ for additional help.

Late Policy

No late days will be allowed.