# Master Fundamentals of Data Science

**Purdue’s Data Science modules are a series of five-week, 100% online, 1.5 continuing education credit (CEU’s) ****courses** that cover foundational data science topics and prepare students to master data science-oriented research and professional skills.

## Connect Your Career Goals to Data

Purdue’s non-credit Data Science courses will teach you how to use quantitative and analytical methods to derive insights from big data in a professional context. These modules are the ideal jumping off point for data-driven professionals looking to expand their data science skills, or for professionals who want to start a career in data. You can complete all courses or choose individual courses that align with your career goals.

By taking our Data Science modules, you will learn foundational data science skills including:

- Python
- SQL
- Big Data
- Machine Learning
- Java

### Job Titles

The McKinsey Global Institute Estimates that the United States could have as many as 250,000 open data science jobs by 2024. Data science skills are essential to jobs such as:

- Data Scientist
- Machine Learning Specialist
- Data Engineer
- Software Engineer
- Data Analyst
- Data Architect
- Business Intelligence Analyst

### Courses

This course provides an overview of data science methods used for data-driven discovery, extraction of knowledge, and informed decision making. Specifically, this short module will focus on computational methods and statistical techniques to reason about uncertainty, conduct hypothesis tests, infer causal relationships, and apply predictive models. This course will also discuss how sampling biases can impact fairness in decision making, and provide hands-on experience in how to make successful inferences from data in an increasingly data-rich and data-driven world.

**Course Outcomes**

- Identify effects of sampling, fairness, and biases, on claims made from data
- Create and execute scripts in Python to conduct multiple hypothesis tests and apply predictive models.
- Calculate uncertainty estimates, visualize, and evaluate significance for small and medium scale data sets.
- Recognize causal vs. correlational claims and potential for error due to multiple comparisons.

**Prerequisites**

- Linear Algebra
- Probability and Statistics
- Python Programming

This course introduces students to the fundamentals of database management systems (DBMS) from a user's perspective. The principles of modeling an enterprise using Entity-Relationship diagrams and transforming the model into a relational or NoSQL database are illustrated through a range of examples. The SQL language is used to create, query, aggregate, and update a relational database. NOSQL databases and the related data models (column, graph, and document-based) are introduced.

**Course Outcomes**

- Carry out database design steps from conceptual to logical to physical design.
- Use SQL commands to define the structure of a relational database, populate, update and delete data in the database, retrieve data having specified characteristics, and specify access control.
- Explain the differences, advantages, and disadvantages of relational and NOSQL databases.
- Describe features of relational databases not needed in big data applications.
- Create a document-based, NOSQL database like mongoDB and movie data from an SQL to a NOSQL database.
- Understand the benefits and downsides of creating index structures on query performance for relational, and NOSQL databases.
- Explain the difference between hash indices and B-tree indices.
- Analyze large data sets created from piecing together multiple data files through the application of SQL queries.

**Prerequisites**

- Data Engineering I (CS 50023)
- Linear Algebra
- Probability and Statistics
- Python Programming

This course covers four major topics: core concepts in logic and discrete structures, basic tools for design and analysis of algorithms, data structures for important operations, algorithmic paradigms for a diverse set of problems. The first part of the course introduces propositional and predicate logic, sets, functions, and operations on these structures. The second part of the course introduces techniques for proofs and analyses, including asymptotic analysis of algorithms. The third part of the course motivates data structures, ways of organizing and storing data, for efficiently performing operations on structures such as graphs and trees.

**Course Outcomes:**

- Students will be able to formulate problems in propositional and predicate logic.
- Students will be able to use discrete structures such as sets, functions, and relations, to model problems.
- Students will be able to use proof techniques for establishing the correctness of statements and programs.
- Students will be able to analyze the runtime of algorithms using asymptotic analysis.
- Students will be able to use data structures to efficiently organize data for various operations.
- Students will be able to apply algorithmic paradigms to solve various problems, reason about their optimality, and characterize their runtime.

**Prerequisites: **

- Familiarity with Calculus

This course provides an overview of data science methods used for data-driven discovery, extraction of knowledge, and informed decision making. Specifically, this short module will focus on computational methods and statistical techniques to reason about uncertainty, conduct hypothesis tests, infer causal relationships, and apply predictive models. This course will also discuss how sampling biases can impact fairness in decision making, and provide hands-on experience in how to make successful inferences from data in an increasingly data-rich and data-driven world.

**Course Outcomes:**

- Identify effects of sampling, fairness, and biases, on claims made from data
- Create and execute scripts in Python to conduct multiple hypothesis tests and apply predictive models.
- Calculate uncertainty estimates, visualize, and evaluate significance for small and medium scale data sets.
- Recognize causal vs. correlational claims and potential for error due to multiple comparisons.

**Prerequisites:**

- Working knowledge of Python.
- Data Engineering I (DSC Module: CS 50023)
- Probability and Statistics (DSC Module: STAT 59800)
- Foundations of Computer Science (DSC Module: CS 59000 FCS)
- Linear Algebra for Data Science (DSC Module: MA 59800)

This is the second of two courses. This course will explain how to use R (a free, open-source program) for statistical analysis, computer generated descriptive statistics, and probability that is necessary for understanding inferential statistics. This course will also cover a derivation of likelihood methods.

**Course Outcomes:**

- Modify existing computer code to analyze data, calculate probabilities and percentiles, and simulate statistical concepts.
- Analyze specific probability problems, including methods, by hand and computer calculations.
- Analyze data to generate an appropriate point estimate using likelihood functions.
- Determine confidence intervals based on the appropriate likelihood functions

**Prerequisites:**

- Calculus III
- Programming experience

This course discusses topics including: modeling of data, applications and methods of linear systems & eigenvalues on networks, massive matrix methods for data analysis including singular value decomposition, principal components, and regression; numerical optimization including linear programming and data.

**Course Outcomes:**

- Use the programming language Julia to manipulate and create numerical computational algorithms, methods, and routines in the context of data science problems (throughout the whole course).
- Identify the role numerical computing plays in Data Science through applications, models, and examples. (Unit 1)
- Discriminate between floating point arithmetic and “exact/discrete” computations (Unit 2)
- Recognize common numerical algorithms including the power method, singular value decomposition, randomized matrix computations in common data science methods including PageRank, spectral clustering, semi-supervised learning, principal components analysis, discriminant analysis, support vector machines. (Unit 3 & 4)
- Recognize the relationship between graph data and numerical computing via the adjacency matrix. (Unit 4)
- Identify key types of optimization problems including linear programs, convex and quadratic problems, and also non-convex problems that arise in non-negative matrix factorization. (Unit 5)
- Apply optimization algorithms including Newton’s method and alternating algorithms, as well as software tools to solve optimization problems. (Unit 5)

**Prerequisites:**

- Probability and Statistics (STAT 5900PS)
- Linear Algebra for Data Science (MA 59800)
- Programming experience in an object-oriented language (Java, C++) and basic understanding of common data structures

**Course Outcomes:**

- Students will learn the meaning and significance of eigenvalues and eigenvectors.
- Students will learn how to find eigenvalues and eigenvectors by hand for small matrices, and how to use matlab to find them for large matrices.
- Students will learn the QR decomposition and its mathematical context and meaning, and will use it to solve least-squares problems.
- Students will learn about orthogonal matrices and their geometrical properties, and about unitary matrices.
- Students will learn the spectral theorem.

The course first introduces a conceptual framework for understanding professional and ethical responsibility, then focuses on applying this framework through repeated practice with case studies. The capstone of the course is an original case study analysis focusing on a case from each student's own area of research.

**Course Outcomes:**

- Identify ethical issues associated with applications of data science in a variety of professional settings.
- Assess and critique the actions of individuals, corporations, governments and other organizations as ethical or unethical.
- Apply general ethics principles to the specific, concrete actions of individuals, corporations, governments and other organizations.
- Formulate sound, well-reasoned arguments, and communicate them clearly, by writing reports implementing the case study procedure developed in the course.
- Generate a case study of their own, by submitting a final case study report implementing the case study procedure developed during in the course.

**Prerequisites: **None

#### At a Glance

**Modality:** 100% online

**Access:** For the duration of each course

**To find the next available course: **visit the course schedule webpage.

**Weekly commitment:** Approximately 3-5 study hours per week

**Who should enroll:** This program is open to all students with a STEM background but is specifically designed for technical professionals, working engineers, and all professionals who use or aspire to use data science in their jobs.

**For more information:** contact noncredit@purdue.edu.