# Data Science Bootcamp Curriculum

## Curriculum Highlights

###### Best Practices With a Business-First Approach

###### R, Python, and Advanced Cloud-Based Tools

###### Interaction With Practitioners and Experts

###### Additional Tutorials and Example Exercises

## Curriculum Overview

###### Introduction to R Programming

##### Topics covered

- R basics

- R data types
- R language features
- R visualization
- Recommended R packages

Sample video |

###### Fundamentals of Data Mining

##### Topics covered

- Data attribute types

- Data pre-processing
- Similarity measures
- Data exploration

Sample slide | Sample video |

###### Introduction to Azure Machine Learning

##### Topics covered

- Azure ML basics

- Azure ML preprocessing
- Azure ML visualization

Sample video |

###### Introduction to Big Data, Data Science, and Predictive Analytics

##### Topics covered

- Big Data

- ETL Pipelines
- Data Mining
- Predictive Analytics

Sample slide | Sample video |

###### Importance of 'Data' in Data Science

##### Topics covered

- Understanding how and why “data beats algorithms”
- Importance of data cleaning, data pre-processing, and business domain knowledge

Sample slide | Sample video |

###### Data Exploration and Visualization

##### Topics covered

- Various data visualization and exploration techniques and packages
- Interpreting boxplots
- Histograms
- Density plots
- Scatterplots
- Segmentation and Simpson’s paradox

Sample slide | Sample video |

###### Feature Engineering

##### Topics covered

- Calculating features from numeric features

- Binning
- Grouping
- Quantizing
- Ratios and mathematical transforms for features in different applications

Sample slide | Sample video |

###### Storytelling with Data

##### Topics covered

- Understanding that goal of data visualization is communicating insights
- Interactive discussion on various interpretations of plots
- Learning how to identify data visualizations most appropriate to answer business questions

Sample slide | Sample video |

###### Predictive Modeling for Real World Problems

##### Topics covered

- Face detection

- Adversarial machine learning
- Spam detection
- Translating a real world problem to a machine learning problem

Sample slide | Sample video |

###### Supervised Learning and Classification

Supervised learning is about learning from historical data. We will understand some of the key assumptions in predictive modeling. We will discuss in what scenarios the distribution of future data will not remain the same as the historical data.

##### Topics covered

- Supervised learning vs. Unsupervised learning

- Features
- Predictors
- Labels
- Target values
- Training
- Testing
- Evaluation

###### Decision Tree Classification

We will start learning to build predictive models by understanding decision tree classification in depth. We will start with an understanding of how we split nodes in a decision tree, and impurity measures like entropy, and Gini index. We will also understand the idea of varying the complexity of a decision tree by changing decision tree parameters such as maximum depth, number of observations on the leaf node, complexity parameter, etc.

##### Topics covered

- Decision tree learning

- Impurity measures: Entropy and Gini index
- Varying decision tree complexity by varying model parameters

###### Building and Evaluating a Classification Model

We will build a classification model using decision tree learning. We will learn how to create train/test datasets, train the model, evaluate the model and vary model hyperparameters.

##### Topics covered

- Train/test split

- Training, prediction, and evaluation
- Varying model hyperparameters such as maximum depth
- Number of observations on leaf nodes
- Minimum number of observations for splitting

###### Evaluation Metrics for Classification Models

Once we have understood how to build a predictive model, we will discuss the importance of defining the correct evaluation metrics. We will discuss real-world anecdotes to discuss under what circumstances one metric might be a better metric than the other.

##### Topics covered

- Confusion matrix

- False/true positives and false/true negatives
- Accuracy
- Precision
- Recall
- F1-score
- ROC curve and area under the ROC curve

###### Generalization and Overfitting

Building a model that generalizes well requires a solid understanding of the fundamentals. We will understand what do we mean by generalization and overfitting. We will also discuss the ideas of bias and variance and how the complexity of a model can impact the bias and variance of our model.

##### Topics covered

- Generalization

- Overfitting
- Bias and variance
- Repeatability
- Bootstrap sampling

###### Tuning of Model Hyperparameters

How do we build a model that generalizes well and does not overfit? The answer is by adjusting the complexity of the machine learning model to the right level. This process known as hyperparameter tuning is one of the most important skills you will learn. Using the decision tree learning parameters as an example we will observe how a model is impacted by creating a deeper or a shallow tree. We will do practical hyperparameter tuning exercises using cross-validation.

##### Topics covered

- Model complexity

- Bias and variance
- K-fold cross-validation
- Leave one out cross-validation
- Time series cross-validation

###### Bagging

##### Topics covered

- Binomial distribution

- Review of bias/variance
- Overfitting and generalization
- Sampling with/without replacement
- Bootstrapped sampling

Sample slide | Sample video |

###### Random Forest

##### Topics covered

- A quick review of decision tree splits

- Column randomization trick and why it is helpful in building more generalized models

Sample slide | Sample video |

###### Random Forest Hyperparameter Tuning

##### Topics covered

- Tuning parameters like depth

- Number of trees
- Number of random features selected etc.
- Using R/Python libraries and Azure ML Studio to tune a model

Sample slide | Sample video |

###### Boosting Introduction

##### Topics covered

- Strength of weak learners

- Boosting intuition
- Altering a sampling distribution

Sample slide | Sample video |

###### Mechanics of Boosting and its Pitfalls

##### Topics covered

- AdaBoost

- Update of weights of training data points and models in the ensemble
- Penalty function
- Strength and weaknesses of boosting

Sample slide |

###### Online Experimentation

##### Topics covered

- A/B Testing

- Multivariate tests
- Some interesting online experiments that defy intuition
- Online vs. offline metrics

Sample slide | Sample video |

###### Hypothesis Testing Fundamentals

##### Topics covered

- Control

- Treatment and hypothesis testing
- Type I, Type II error and interactions
- Confidence interval and p-values
- Z-table and t-table

Sample slide | Sample video |

###### Running Experiments in Real-world

##### Topics covered

- Steps in online experimentation: Choosing treatment, control, and factors

- Sample size selection
- Effect size. A/A tests
- Logging and instrumentation
- Segmentation and interpretation

Sample slide | Sample video |

###### Deploying a Predictive Model as a Service

A user-interface into a model makes it easier to see how it would work in the real world, where a new customer enters the systems and data is collected on their age, gender, and so on. We teach you direct and simple processes for setting up real-time prediction endpoints in the cloud, allowing you to access your trained model from anywhere in the world. We walk you through constructing your own endpoints and show a few practical demos of how this can be used to expose a predictive model to anyone you’d like to use it and see how it takes new data and makes a prediction.

##### Topics covered

- Machine learning in cloud
- Azure ML studio
- Machine learning model management with Azure ML studio

###### Introduction to Text Analytics

##### Topics covered

- Structured versus semi-structured versus unstructured data

- Structuring raw text
- Tokenization
- Stemming and lemmatization
- Stop words removal
- Treating punctuation, casing, and numbers in the text, creating a terms dictionary
- Drawbacks of simple word frequency counts
- Term frequency – inverse document frequency
- Document similarity measure

Sample slide | Sample video |

###### Unsupervised Learning and k-means Clustering

##### Topics covered

- Real-world problems that unsupervised learning algorithms solve

- The K-means clustering algorithm
- Euclidean distance measure
- Defining k
- The Elbow Method
- Strengths and limitations of k-means clustering

Sample slide | Sample video |

###### Math Fundamentals

##### Topics covered

- Introduction

- Derivatives and gradients
- Minima/maxima
- Convexity of functions and why convexity matters

Sample slide | Sample video |

###### Optimizing the Cost Function

##### Topics covered

- Gradient descent
- Batch gradient descent
- Stochastic gradient descent
- Mini-batch gradient descent
- Global vs. local minima

Sample slide | Sample video |

###### Evaluation of Regression Models

##### Topics covered

- Mean absolute error

- Root mean square error
- R-squared and adjusted R-squared measure

Sample slide | Sample video |

###### Predicting Prices of Real Estate using a Linear Regression Model

##### Topics covered

- Data cleaning
- Dropping low-quality features
- Select strongest features using Pearson correlation
- Adjusting the learning rate, number of training epochs, L2 regularization weight

Sample slide | Sample video |

###### Regularization

##### Topics covered

- Regularization intuition

- L1 regularization or LASSO
- L2 regularization or Ridge regression

Sample slide |

###### Collaborative and Content-based Recommendations

##### Topics covered

- Collaborative versus content recommenders

- The data structure of collaborative versus content-based recommenders
- Building user-profiles and item profiles

Sample slide | Sample video |

###### Measures of Similarity

##### Topics covered

- Pearson’s correlation

- Cosine similarity. N nearest neighbors
- Weighted and centered metrics

Sample slide | Sample video |

###### Evaluation Metrics for Recommender Systems

##### Topics covered

- Mean absolute error

- Root mean square error
- Discounted Cumulative Gain (DCG) and normalized discounted cumulative gain (nDCG) for ranking evaluation

Sample slide | Sample video |

###### Big Data Engineering

##### Topics covered

- Distributed computing and cloud infrastructure

- Hadoop
- Hadoop Distributed File System
- MapReduce
- Hive
- Mahout
- Spark

Sample slide | Sample video |

###### Real-Time/IoT

##### Topics covered

- Extract, transform, and load pipelines

- Data ingestion
- Event brokers
- Stream storage
- Azure Event Hub
- Stream Processing
- Event processors
- Access rights and access policies
- Querying streaming data and analysis

Sample slide |

###### Kaggle Capstone

##### Topics covered

- Data pre-processing
- Data cleaning
- Feature Engineering
- Model Training
- Model Tuning

###### Self-Directed Labs

##### Topics covered

- Azure SQL Database
- HBase

- Hadoop
- HDInsight
- Azure PowerShell
- Mahout
- Spark
- Live Twitter Sentiment Analysis

Sample slide |

###### Supplementary Topics

##### Topics covered

- Naïve Bayes classifier
- Logistic regression classifier

- Time series forecasting
- R Shiny interactive dashboards
- Advanced feature engineering
- Advanced model validation techniques
- Support vector machines
- Acing data science interviews

### Get in touch

Feel free to ask questions or share your comments with us. We’ll get back to you soon.

You can also reach out to us by phone or email.