Data Science Bootcamp Curriculum
Curriculum Highlights
Best Practices With a Business-First Approach
R, Python, and Advanced Cloud-Based Tools
Interaction With Practitioners and Experts
Additional Tutorials and Example Exercises
Curriculum Overview
Introduction to R Programming
Topics covered
- R basics
- R data types
- R language features
- R visualization
- Recommended R packages
Sample video |
Fundamentals of Data Mining
Topics covered
- Data attribute types
- Data pre-processing
- Similarity measures
- Data exploration
Sample slide | Sample video |
Introduction to Azure Machine Learning
Topics covered
- Azure ML basics
- Azure ML preprocessing
- Azure ML visualization
Sample video |
Introduction to Big Data, Data Science, and Predictive Analytics
Topics covered
- Big Data
- ETL Pipelines
- Data Mining
- Predictive Analytics
Sample slide | Sample video |
Importance of 'Data' in Data Science
Topics covered
- Understanding how and why “data beats algorithms”
- Importance of data cleaning, data pre-processing, and business domain knowledge
Sample slide | Sample video |
Data Exploration and Visualization
Topics covered
- Various data visualization and exploration techniques and packages
- Interpreting boxplots
- Histograms
- Density plots
- Scatterplots
- Segmentation and Simpson’s paradox
Sample slide | Sample video |
Feature Engineering
Topics covered
- Calculating features from numeric features
- Binning
- Grouping
- Quantizing
- Ratios and mathematical transforms for features in different applications
Sample slide | Sample video |
Storytelling with Data
Topics covered
- Understanding that goal of data visualization is communicating insights
- Interactive discussion on various interpretations of plots
- Learning how to identify data visualizations most appropriate to answer business questions
Sample slide | Sample video |
Predictive Modeling for Real World Problems
Topics covered
- Face detection
- Adversarial machine learning
- Spam detection
- Translating a real world problem to a machine learning problem
Sample slide | Sample video |
Supervised Learning and Classification
Supervised learning is about learning from historical data. We will understand some of the key assumptions in predictive modeling. We will discuss in what scenarios the distribution of future data will not remain the same as the historical data.
Topics covered
- Supervised learning vs. Unsupervised learning
- Features
- Predictors
- Labels
- Target values
- Training
- Testing
- Evaluation
Decision Tree Classification
We will start learning to build predictive models by understanding decision tree classification in depth. We will start with an understanding of how we split nodes in a decision tree, and impurity measures like entropy, and Gini index. We will also understand the idea of varying the complexity of a decision tree by changing decision tree parameters such as maximum depth, number of observations on the leaf node, complexity parameter, etc.
Topics covered
- Decision tree learning
- Impurity measures: Entropy and Gini index
- Varying decision tree complexity by varying model parameters
Building and Evaluating a Classification Model
We will build a classification model using decision tree learning. We will learn how to create train/test datasets, train the model, evaluate the model and vary model hyperparameters.
Topics covered
- Train/test split
- Training, prediction, and evaluation
- Varying model hyperparameters such as maximum depth
- Number of observations on leaf nodes
- Minimum number of observations for splitting
Evaluation Metrics for Classification Models
Once we have understood how to build a predictive model, we will discuss the importance of defining the correct evaluation metrics. We will discuss real-world anecdotes to discuss under what circumstances one metric might be a better metric than the other.
Topics covered
- Confusion matrix
- False/true positives and false/true negatives
- Accuracy
- Precision
- Recall
- F1-score
- ROC curve and area under the ROC curve
Generalization and Overfitting
Building a model that generalizes well requires a solid understanding of the fundamentals. We will understand what do we mean by generalization and overfitting. We will also discuss the ideas of bias and variance and how the complexity of a model can impact the bias and variance of our model.
Topics covered
- Generalization
- Overfitting
- Bias and variance
- Repeatability
- Bootstrap sampling
Tuning of Model Hyperparameters
How do we build a model that generalizes well and does not overfit? The answer is by adjusting the complexity of the machine learning model to the right level. This process known as hyperparameter tuning is one of the most important skills you will learn. Using the decision tree learning parameters as an example we will observe how a model is impacted by creating a deeper or a shallow tree. We will do practical hyperparameter tuning exercises using cross-validation.
Topics covered
- Model complexity
- Bias and variance
- K-fold cross-validation
- Leave one out cross-validation
- Time series cross-validation
Bagging
Topics covered
- Binomial distribution
- Review of bias/variance
- Overfitting and generalization
- Sampling with/without replacement
- Bootstrapped sampling
Sample slide | Sample video |
Random Forest
Topics covered
- A quick review of decision tree splits
- Column randomization trick and why it is helpful in building more generalized models
Sample slide | Sample video |
Random Forest Hyperparameter Tuning
Topics covered
- Tuning parameters like depth
- Number of trees
- Number of random features selected etc.
- Using R/Python libraries and Azure ML Studio to tune a model
Sample slide | Sample video |
Boosting Introduction
Topics covered
- Strength of weak learners
- Boosting intuition
- Altering a sampling distribution
Sample slide | Sample video |
Mechanics of Boosting and its Pitfalls
Topics covered
- AdaBoost
- Update of weights of training data points and models in the ensemble
- Penalty function
- Strength and weaknesses of boosting
Sample slide |
Online Experimentation
Topics covered
- A/B Testing
- Multivariate tests
- Some interesting online experiments that defy intuition
- Online vs. offline metrics
Sample slide | Sample video |
Hypothesis Testing Fundamentals
Topics covered
- Control
- Treatment and hypothesis testing
- Type I, Type II error and interactions
- Confidence interval and p-values
- Z-table and t-table
Sample slide | Sample video |
Running Experiments in Real-world
Topics covered
- Steps in online experimentation: Choosing treatment, control, and factors
- Sample size selection
- Effect size. A/A tests
- Logging and instrumentation
- Segmentation and interpretation
Sample slide | Sample video |
Deploying a Predictive Model as a Service
A user-interface into a model makes it easier to see how it would work in the real world, where a new customer enters the systems and data is collected on their age, gender, and so on. We teach you direct and simple processes for setting up real-time prediction endpoints in the cloud, allowing you to access your trained model from anywhere in the world. We walk you through constructing your own endpoints and show a few practical demos of how this can be used to expose a predictive model to anyone you’d like to use it and see how it takes new data and makes a prediction.
Topics covered
- Machine learning in cloud
- Azure ML studio
- Machine learning model management with Azure ML studio
Introduction to Text Analytics
Topics covered
- Structured versus semi-structured versus unstructured data
- Structuring raw text
- Tokenization
- Stemming and lemmatization
- Stop words removal
- Treating punctuation, casing, and numbers in the text, creating a terms dictionary
- Drawbacks of simple word frequency counts
- Term frequency – inverse document frequency
- Document similarity measure
Sample slide | Sample video |
Unsupervised Learning and k-means Clustering
Topics covered
- Real-world problems that unsupervised learning algorithms solve
- The K-means clustering algorithm
- Euclidean distance measure
- Defining k
- The Elbow Method
- Strengths and limitations of k-means clustering
Sample slide | Sample video |
Math Fundamentals
Topics covered
- Introduction
- Derivatives and gradients
- Minima/maxima
- Convexity of functions and why convexity matters
Sample slide | Sample video |
Optimizing the Cost Function
Topics covered
- Gradient descent
- Batch gradient descent
- Stochastic gradient descent
- Mini-batch gradient descent
- Global vs. local minima
Sample slide | Sample video |
Evaluation of Regression Models
Topics covered
- Mean absolute error
- Root mean square error
- R-squared and adjusted R-squared measure
Sample slide | Sample video |
Predicting Prices of Real Estate using a Linear Regression Model
Topics covered
- Data cleaning
- Dropping low-quality features
- Select strongest features using Pearson correlation
- Adjusting the learning rate, number of training epochs, L2 regularization weight
Sample slide | Sample video |
Regularization
Topics covered
- Regularization intuition
- L1 regularization or LASSO
- L2 regularization or Ridge regression
Sample slide |
Collaborative and Content-based Recommendations
Topics covered
- Collaborative versus content recommenders
- The data structure of collaborative versus content-based recommenders
- Building user-profiles and item profiles
Sample slide | Sample video |
Measures of Similarity
Topics covered
- Pearson’s correlation
- Cosine similarity. N nearest neighbors
- Weighted and centered metrics
Sample slide | Sample video |
Evaluation Metrics for Recommender Systems
Topics covered
- Mean absolute error
- Root mean square error
- Discounted Cumulative Gain (DCG) and normalized discounted cumulative gain (nDCG) for ranking evaluation
Sample slide | Sample video |
Big Data Engineering
Topics covered
- Distributed computing and cloud infrastructure
- Hadoop
- Hadoop Distributed File System
- MapReduce
- Hive
- Mahout
- Spark
Sample slide | Sample video |
Real-Time/IoT
Topics covered
- Extract, transform, and load pipelines
- Data ingestion
- Event brokers
- Stream storage
- Azure Event Hub
- Stream Processing
- Event processors
- Access rights and access policies
- Querying streaming data and analysis
Sample slide |
Kaggle Capstone
Topics covered
- Data pre-processing
- Data cleaning
- Feature Engineering
- Model Training
- Model Tuning
Self-Directed Labs
Topics covered
- Azure SQL Database
- HBase
- Hadoop
- HDInsight
- Azure PowerShell
- Mahout
- Spark
- Live Twitter Sentiment Analysis
Sample slide |
Supplementary Topics
Topics covered
- Naïve Bayes classifier
- Logistic regression classifier
- Time series forecasting
- R Shiny interactive dashboards
- Advanced feature engineering
- Advanced model validation techniques
- Support vector machines
- Acing data science interviews
Get in touch
Feel free to ask questions or share your comments with us. We’ll get back to you soon.
You can also reach out to us by phone or email.