# Data Science Bootcamp Curriculum

Comprehensive curriculum designed by practitioners with valuable experiences.

## Curriculum Highlights

### Best Practices With a Business-First Approach

### R, Python, and Advanced Cloud-Based Tools

### Interaction With Practitioners and Experts

### Additional Tutorials and Example Exercises

## Curriculum Overview

##### Topics covered

- R basics

- R data types
- R language features
- R visualization
- Recommended R packages

Sample video |

##### Topics covered

- Data attribute types

- Data pre-processing
- Similarity measures
- Data exploration

Sample slide | Sample video |

##### Topics covered

- Azure ML basics

- Azure ML preprocessing
- Azure ML visualization

Sample video |

We introduce you to the wide world of Big Data, throwing back the curtain on the diversity and ubiquity of data science in the modern world. We also give you a bird’s eye view of the subfields of predictive analytics and the pieces of a big data pipeline.

##### Topics covered

- Big Data

- ETL Pipelines
- Data Mining
- Predictive Analytics

Sample slide | Sample video |

##### Topics covered

- Understanding how and why “data beats algorithms”
- Importance of data cleaning, data pre-processing, and business domain knowledge

Sample slide | Sample video |

##### Topics covered

- Various data visualization and exploration techniques and packages
- Interpreting boxplots
- Histograms
- Density plots
- Scatterplots
- Segmentation and Simpson’s paradox

Sample slide | Sample video |

##### Topics covered

- Calculating features from numeric features

- Binning
- Grouping
- Quantizing
- Ratios and mathematical transforms for features in different applications

Sample slide | Sample video |

##### Topics covered

- Understanding that goal of data visualization is communicating insights
- Interactive discussion on various interpretations of plots
- Learning how to identify data visualizations most appropriate to answer business questions

Sample slide | Sample video |

Taking a real world business problem and translating it into a machine learning problem takes a lot of practice. We will take some common applications of predictive analytics around us and discuss the process of turning that into a predictive analytics problem.

##### Topics covered

- Face detection

- Adversarial machine learning
- Spam detection
- Translating a real world problem to a machine learning problem

Sample slide | Sample video |

Supervised learning is about learning from historical data. We will understand some of the key assumptions in predictive modeling. We will discuss in what scenarios the distribution of future data will not remain the same as the historical data.

##### Topics covered

- Supervised learning vs. Unsupervised learning

- Features
- Predictors
- Labels
- Target values
- Training
- Testing
- Evaluation

We will start learning to build predictive models by understanding decision tree classification in depth. We will start with an understanding of how we split nodes in a decision tree, and impurity measures like entropy, and Gini index. We will also understand the idea of varying the complexity of a decision tree by changing decision tree parameters such as maximum depth, number of observations on the leaf node, complexity parameter, etc.

##### Topics covered

- Decision tree learning

- Impurity measures: Entropy and Gini index
- Varying decision tree complexity by varying model parameters

We will build a classification model using decision tree learning. We will learn how to create train/test datasets, train the model, evaluate the model and vary model hyperparameters.

##### Topics covered

- Train/test split

- Training, prediction, and evaluation
- Varying model hyperparameters such as maximum depth
- Number of observations on leaf nodes
- Minimum number of observations for splitting

Once we have understood how to build a predictive model, we will discuss the importance of defining the correct evaluation metrics. We will discuss real-world anecdotes to discuss under what circumstances one metric might be a better metric than the other.

##### Topics covered

- Confusion matrix

- False/true positives and false/true negatives
- Accuracy
- Precision
- Recall
- F1-score
- ROC curve and area under the ROC curve

Building a model that generalizes well requires a solid understanding of the fundamentals. We will understand what do we mean by generalization and overfitting. We will also discuss the ideas of bias and variance and how the complexity of a model can impact the bias and variance of our model.

##### Topics covered

- Generalization

- Overfitting
- Bias and variance
- Repeatability
- Bootstrap sampling

How do we build a model that generalizes well and does not overfit? The answer is by adjusting the complexity of the machine learning model to the right level. This process known as hyperparameter tuning is one of the most important skills you will learn. Using the decision tree learning parameters as an example we will observe how a model is impacted by creating a deeper or a shallow tree. We will do practical hyperparameter tuning exercises using cross-validation.

##### Topics covered

- Model complexity

- Bias and variance
- K-fold cross-validation
- Leave one out cross-validation
- Time series cross-validation

##### Topics covered

- Binomial distribution

- Review of bias/variance
- Overfitting and generalization
- Sampling with/without replacement
- Bootstrapped sampling

Sample slide | Sample video |

##### Topics covered

- A quick review of decision tree splits

- Column randomization trick and why it is helpful in building more generalized models

Sample slide | Sample video |

##### Topics covered

- Tuning parameters like depth

- Number of trees
- Number of random features selected etc.
- Using R/Python libraries and Azure ML Studio to tune a model

Sample slide | Sample video |

Boosting is an immensely powerful and understandably popular technique. We discuss the fundamental ideas behind boosting. We also get an intuitive understanding of how one can alter the sampling distribution while sampling for each round of boosting.

##### Topics covered

- Strength of weak learners

- Boosting intuition
- Altering a sampling distribution

Sample slide | Sample video |

##### Topics covered

- AdaBoost

- Update of weights of training data points and models in the ensemble
- Penalty function
- Strength and weaknesses of boosting

Sample slide |

##### Topics covered

- A/B Testing

- Multivariate tests
- Some interesting online experiments that defy intuition
- Online vs. offline metrics

Sample slide | Sample video |

##### Topics covered

- Control

- Treatment and hypothesis testing
- Type I, Type II error and interactions
- Confidence interval and p-values
- Z-table and t-table

Sample slide | Sample video |

##### Topics covered

- Steps in online experimentation: Choosing treatment, control, and factors

- Sample size selection
- Effect size. A/A tests
- Logging and instrumentation
- Segmentation and interpretation

Sample slide | Sample video |

A user-interface into a model makes it easier to see how it would work in the real world, where a new customer enters the systems and data is collected on their age, gender, and so on. We teach you direct and simple processes for setting up real-time prediction endpoints in the cloud, allowing you to access your trained model from anywhere in the world. We walk you through constructing your own endpoints and show a few practical demos of how this can be used to expose a predictive model to anyone you’d like to use it and see how it takes new data and makes a prediction.

##### Topics covered

- Machine learning in cloud
- Azure ML studio
- Machine learning model management with Azure ML studio

##### Topics covered

- Structured versus semi-structured versus unstructured data

- Structuring raw text
- Tokenization
- Stemming and lemmatization
- Stop words removal
- Treating punctuation, casing, and numbers in the text, creating a terms dictionary
- Drawbacks of simple word frequency counts
- Term frequency – inverse document frequency
- Document similarity measure

Sample slide | Sample video |

##### Topics covered

- Real-world problems that unsupervised learning algorithms solve

- The K-means clustering algorithm
- Euclidean distance measure
- Defining k
- The Elbow Method
- Strengths and limitations of k-means clustering

Sample slide | Sample video |

##### Topics covered

- Introduction

- Derivatives and gradients
- Minima/maxima
- Convexity of functions and why convexity matters

Sample slide | Sample video |

##### Topics covered

- Gradient descent
- Batch gradient descent
- Stochastic gradient descent
- Mini-batch gradient descent
- Global vs. local minima

Sample slide | Sample video |

We discuss the different evaluation metrics for a regression model and in what scenarios each of them might be a good choice.

##### Topics covered

- Mean absolute error

- Root mean square error
- R-squared and adjusted R-squared measure

Sample slide | Sample video |

##### Topics covered

- Data cleaning
- Dropping low-quality features
- Select strongest features using Pearson correlation
- Adjusting the learning rate, number of training epochs, L2 regularization weight

Sample slide | Sample video |

##### Topics covered

- Regularization intuition

- L1 regularization or LASSO
- L2 regularization or Ridge regression

Sample slide |

##### Topics covered

- Collaborative versus content recommenders

- The data structure of collaborative versus content-based recommenders
- Building user-profiles and item profiles

Sample slide | Sample video |

Both collaborative and content-based recommenders rely on similarity but how do we find similarity between vectors. We discuss some approaches to measure similarity and when to use which similarity measure.

##### Topics covered

- Pearson’s correlation

- Cosine similarity. N nearest neighbors
- Weighted and centered metrics

Sample slide | Sample video |

##### Topics covered

- Mean absolute error

- Root mean square error
- Discounted Cumulative Gain (DCG) and normalized discounted cumulative gain (nDCG) for ranking evaluation

Sample slide | Sample video |

##### Topics covered

- Distributed computing and cloud infrastructure

- Hadoop
- Hadoop Distributed File System
- MapReduce
- Hive
- Mahout
- Spark

Sample slide | Sample video |

##### Topics covered

- Extract, transform, and load pipelines

- Data ingestion
- Event brokers
- Stream storage
- Azure Event Hub
- Stream Processing
- Event processors
- Access rights and access policies
- Querying streaming data and analysis

Sample slide |

##### Topics covered

- Data pre-processing
- Data cleaning
- Feature Engineering
- Model Training
- Model Tuning

##### Topics covered

- Azure SQL Database
- HBase

- Hadoop
- HDInsight
- Azure PowerShell
- Mahout
- Spark
- Live Twitter Sentiment Analysis

Sample slide |

##### Topics covered

- Naïve Bayes classifier
- Logistic regression classifier

- Time series forecasting
- R Shiny interactive dashboards
- Advanced feature engineering
- Advanced model validation techniques
- Support vector machines
- Acing data science interviews

#### Contact Us

Feel free to ask questions or share your comments with us. We'll get back to you soon. You can also reach out to us by phone or email.