Learn to build large language model applications: vector databases, langchain, fine tuning and prompt engineering. Learn more

Data Science Bootcamp Curriculum

Comprehensive curriculum designed by practitioners with valuable experiences.

Curriculum Highlights

Curriculum Overview

Introduction to R Programming

Here we introduce the basics of the R programming language. R is a free, open-source statistical programming platform. It is designed to make many of the most common data processing tasks as simple as possible. With this knowledge, you’ll be able to engage fully with the hands-on exercises in the class.

Topics covered

R basics

R data types
R language features
R visualization
Recommended R packages

Fundamentals of Data Mining

Here we introduce the fundamentals of data mining which is the process of identifying patterns and relationships that can help businesses in making informed decisions. We cover commonly encountered data attribute types and the techniques for pre-processing the data. In addition to that, similarity measures which are techniques to identify how much two data points are similar and data exploration which is usage of visualization techniques and statistical techniques to find patterns in the data are covered.

Topics covered

Data attribute types

Data pre-processing
Similarity measures
Data exploration

Introduction to Azure Machine Learning

Azure Machine Learning Studio is a fully-featured graphical data science tool in the cloud. You will learn how to upload, analyze, visualize, manipulate, and clean data using the clean and intuitive interface of Azure ML

Topics covered

Azure ML basics

Azure ML preprocessing
Azure ML visualization

Introduction to Big Data, Data Science, and Predictive Analytics

We introduce you to the wide world of Big Data, throwing back the curtain on the diversity and ubiquity of data science in the modern world. We also give you a bird’s eye view of the subfields of predictive analytics and the pieces of a big data pipeline.

Topics covered

Big Data

ETL Pipelines
Data Mining
Predictive Analytics

Importance of 'Data' in Data Science

Beginners in data science often put too much emphasis on machine learning algorithms while ignoring the fact that garbage data will only produce garbage insights. Data quality is one of the most overlooked issues in data science. We discuss challenges and best practices in data acquisition, processing, transformation, cleaning and loading.

Topics covered

Understanding how and why “data beats algorithms”    
Importance of data cleaning, data pre-processing, and business domain knowledge

Data Exploration and Visualization

Through a series of hands-on exercises and a lot of interactive discussions, we will learn how to dissect and explore data. We take different datasets and discuss the best way to explore and visualize data. We form hypotheses and discuss the validity of our hypothesis by using various data exploration and visualization techniques.

Topics covered

Various data visualization and exploration techniques and packages
Interpreting boxplots
Histograms
Density plots
Scatterplots
Segmentation and Simpson’s paradox

Feature Engineering

Feature engineering is one of the most important aspects of building machine learning models. We will practice engineering new features, and clean data before reporting or modeling.

Topics covered

Calculating features from numeric features

Binning
Grouping
Quantizing
Ratios and mathematical transforms for features in different applications

Storytelling with Data

Experienced data professionals will tell you that storytelling is one of the most important skills for communicating insights. We will practice the skill of storytelling while presenting analysis.

Topics covered

Understanding that goal of data visualization is communicating insights
Interactive discussion on various interpretations of plots
Learning how to identify data visualizations most appropriate to answer business questions

Predictive Modeling for Real World Problems

Taking a real world business problem and translating it into a machine learning problem takes a lot of practice. We will take some common applications of predictive analytics around us and discuss the process of turning that into a predictive analytics problem.

Topics covered

Face detection

Adversarial machine learning
Spam detection
Translating a real world problem to a machine learning problem

Supervised Learning and Classification

Supervised learning is about learning from historical data. We will understand some of the key assumptions in predictive modeling. We will discuss in what scenarios the distribution of future data will not remain the same as the historical data.

Topics covered

Supervised learning vs. Unsupervised learning

Features
Predictors
Labels
Target values
Training
Testing
Evaluation

Decision Tree Classification

We will start learning to build predictive models by understanding decision tree classification in depth. We will start with an understanding of how we split nodes in a decision tree, and impurity measures like entropy, and Gini index. We will also understand the idea of varying the complexity of a decision tree by changing decision tree parameters such as maximum depth, number of observations on the leaf node, complexity parameter, etc.

Topics covered

Decision tree learning

Impurity measures: Entropy and Gini index
Varying decision tree complexity by varying model parameters

Building and Evaluating a Classification Model

We will build a classification model using decision tree learning. We will learn how to create train/test datasets, train the model, evaluate the model and vary model hyperparameters.

Topics covered

Train/test split

Training, prediction, and evaluation
Varying model hyperparameters such as maximum depth
Number of observations on leaf nodes
Minimum number of observations for splitting

Evaluation Metrics for Classification Models

Once we have understood how to build a predictive model, we will discuss the importance of defining the correct evaluation metrics. We will discuss real-world anecdotes to discuss under what circumstances one metric might be a better metric than the other.

Topics covered

Confusion matrix

False/true positives and false/true negatives
Accuracy
Precision
Recall
F1-score
ROC curve and area under the ROC curve

Generalization and Overfitting

Building a model that generalizes well requires a solid understanding of the fundamentals. We will understand what do we mean by generalization and overfitting. We will also discuss the ideas of bias and variance and how the complexity of a model can impact the bias and variance of our model.

Topics covered

Generalization

Overfitting
Bias and variance
Repeatability
Bootstrap sampling

Tuning of Model Hyperparameters

How do we build a model that generalizes well and does not overfit? The answer is by adjusting the complexity of the machine learning model to the right level. This process known as hyperparameter tuning is one of the most important skills you will learn. Using the decision tree learning parameters as an example we will observe how a model is impacted by creating a deeper or a shallow tree. We will do practical hyperparameter tuning exercises using cross-validation.

Topics covered

Model complexity

Bias and variance
K-fold cross-validation
Leave one out cross-validation
Time series cross-validation

Bagging

Mathematical understanding of concepts is easier when we start with developing an intuition for the (maybe not so) complex math behind an apparently complex topic. Having built a solid understanding of the concepts of bias, variance, and generalization, we explain why building a committee of models improves generalization. We also review math topics such as bootstrap sampling and binomial distribution that are key to understanding why ensembles work so well.

Topics covered

Binomial distribution

Review of bias/variance
Overfitting and generalization
Sampling with/without replacement
Bootstrapped sampling

Random Forest

Having understood bagging very well, we segue the discussion into the idea of feature/column randomization. We explain how feature randomization helps overcome the greediness of decision tree learning and make a case of Random Forest.

Topics covered

A quick review of decision tree splits

Column randomization trick and why it is helpful in building more generalized models

Random Forest Hyperparameter Tuning

Hands-on exercise to select the appropriate number of trees, the number of random features, and other tuning parameters in a Random Forest and variants of the technique.

Topics covered

Tuning parameters like depth

Number of trees
Number of random features selected etc.
Using R/Python libraries and Azure ML Studio to tune a model

Boosting Introduction

Boosting is an immensely powerful and understandably popular technique. We discuss the fundamental ideas behind boosting. We also get an intuitive understanding of how one can alter the sampling distribution while sampling for each round of boosting.

Topics covered

Strength of weak learners

Boosting intuition
Altering a sampling distribution

Mechanics of Boosting and its Pitfalls

Armed with an intuitive understanding of boosting, we pick AdaBoost as an example. We explain the mechanics of AdaBoost, weight update for training data, altering the sampling distribution, and weight update for the models in an ensemble. We also discuss the strength and weaknesses of boosting and the potential pitfalls of boosting

Topics covered

AdaBoost

Update of weights of training data points and models in the ensemble
Penalty function
Strength and weaknesses of boosting

Online Experimentation

Design of experiments and hypothesis testing are one of the most useful tools in data science. We kick off our discussion with a discussion on why online experimentation is needed in the first place. We also discuss the difference between online and offline metrics. We will have a group activity to discuss the hypothetical ‘Facebook’, ‘Amazon’, and ‘Google’ examples of online metrics.

Topics covered

A/B Testing

Multivariate tests
Some interesting online experiments that defy intuition
Online vs. offline metrics

Hypothesis Testing Fundamentals

Designing and running experiments depend upon a good understanding of hypothesis testing fundamentals. We offer a quick overview of hypothesis testing with all the necessary concepts. We take a practical example and calculate confidence intervals with varying confidence values assuming a small and big sample size. We explain the fundamentals in an intuitive manner without being too involved in the mathematical details.

Topics covered

Control

Treatment and hypothesis testing
Type I, Type II error and interactions
Confidence interval and p-values
Z-table and t-table

Running Experiments in Real-world

Running online experiments in real world is both a science and an art. We discuss the various steps in an experiment and emphasize the importance of each step. We also discuss the potential pitfalls in an online experimentation pipeline.

Topics covered

Steps in online experimentation: Choosing treatment, control, and factors

Sample size selection
Effect size. A/A tests
Logging and instrumentation
Segmentation and interpretation

Deploying a Predictive Model as a Service

A user-interface into a model makes it easier to see how it would work in the real world, where a new customer enters the systems and data is collected on their age, gender, and so on. We teach you direct and simple processes for setting up real-time prediction endpoints in the cloud, allowing you to access your trained model from anywhere in the world. We walk you through constructing your own endpoints and show a few practical demos of how this can be used to expose a predictive model to anyone you’d like to use it and see how it takes new data and makes a prediction.

Topics covered

Machine learning in cloud
Azure ML studio
Machine learning model management with Azure ML studio

Introduction to Text Analytics

Not always will you work with fully structured data. Many applications of data science require analysis of unstructured data such as text. We will teach you the basics of converting text into structured data, and how to model documents to find their similarities and recommend similar documents. We cover the important steps in pre-processing text in order to create textual features and prepare the text for modeling or analysis. This includes stemming and lemmatization, treating punctuation and other textual components, stop words removal, and more. We also demonstrate how to model documents using the term frequency-inverse document frequency and finding similar documents. The hands-on exercise looks at an example of analyzing text and introduces additional problems to solve in pre-processing text/documents.

Topics covered

Structured versus semi-structured versus unstructured data

Structuring raw text
Tokenization
Stemming and lemmatization
Stop words removal
Treating punctuation, casing, and numbers in the text, creating a terms dictionary
Drawbacks of simple word frequency counts
Term frequency – inverse document frequency
Document similarity measure

Unsupervised Learning and k-means Clustering

Unsupervised learning at its core is about revealing the hidden structure of any dataset. Not always are you going to be working with labeled data or records tagged with a label outcome. For example, collecting data on customers’ purchasing habits does not come with a label outcome of ‘high-value customer’ or ‘low-value customer’; that label needs to be created. We teach the underpinnings of the k-means clustering algorithm to solve the problem of finding the common attributes that separate out one cluster group from another. We can then use this to categorize our data based on clusters, or customers of similar attributes such as high-value customers who all have similar spending habits. You will also learn how to approach an unsupervised learning challenge through a hands-on exercise and how to define your cluster groups.

Topics covered

Real-world problems that unsupervised learning algorithms solve

The K-means clustering algorithm
Euclidean distance measure
Defining k
The Elbow Method
Strengths and limitations of k-means clustering

Math Fundamentals

Before talking about linear models, we set up the mathematical foundations of regression models. We start with a discussion of some calculus fundamentals to be able to transition seamlessly into the math behind finding the minimum of the cost function eventually.

Topics covered

Introduction

Derivatives and gradients
Minima/maxima
Convexity of functions and why convexity matters

Optimizing the Cost Function

With the mathematical background already set up, we intuitively understand what should be the cost function for a linear regression model. We frame our cost function and discuss how gradient descent finds the minimum of the cost function. We also emphasize that the particular choice of cost function makes it a convex optimization problem and eliminates the risk of a local minima for us. We compare the batch, stochastic and mini-batch approaches to minimization of cost function.

Topics covered

Gradient descent
Batch gradient descent
Stochastic gradient descent
Mini-batch gradient descent
Global vs. local minima

Evaluation of Regression Models

We discuss the different evaluation metrics for a regression model and in what scenarios each of them might be a good choice.

Topics covered

Mean absolute error

Root mean square error
R-squared and adjusted R-squared measure

Predicting Prices of Real Estate using a Linear Regression Model

We will build a linear regression model to build a real-estate price predictor. We will see how adjusting the regularization penalty and the number of rounds of parameter update can result in a substantial improvement in both the Mean Absolute Error and standard deviating on a 10-fold cross-validation.

Topics covered

Data cleaning
Dropping low-quality features
Select strongest features using Pearson correlation
Adjusting the learning rate, number of training epochs, L2 regularization weight

Regularization

Modern compute resources incentivize overfitting and even practitioners fall for it. We discuss the intuition behind regularization and the penalty parameter. We discuss the L1 and L2 penalty and ridge regression and give a quick overview of LASSO and Ridge regression.

Topics covered

Regularization intuition

L1 regularization or LASSO
L2 regularization or Ridge regression

Collaborative and Content-based Recommendations

Recommender systems are all around us here. We discuss the collaborative and content-based recommenders at a high-level. We also discuss how are items recommended in each case. Various strategies for building item and user profiles are also discussed.

Topics covered

Collaborative versus content recommenders

The data structure of collaborative versus content-based recommenders
Building user-profiles and item profiles

Measures of Similarity

Both collaborative and content-based recommenders rely on similarity but how do we find similarity between vectors. We discuss some approaches to measure similarity and when to use which similarity measure.

Topics covered

Pearson’s correlation

Cosine similarity. N nearest neighbors
Weighted and centered metrics

Evaluation Metrics for Recommender Systems

We discuss the different scenarios a recommender system may be used in. We discuss the difference between a ranking problem and a regression problem and discuss which metrics would be the right metrics for a given problem.

Topics covered

Mean absolute error

Root mean square error
Discounted Cumulative Gain (DCG) and normalized discounted cumulative gain (nDCG) for ranking evaluation

Big Data Engineering

The first challenge of big data isn’t one of analysis, but rather of volume and velocity. How do you process terabytes of data in a reliable, relatively rapid way? We teach you the basics of MapReduce and Hadoop Distributed File System, the technologies which underly Hadoop, the most popular distributed computing platform. We also introduce you to Hive, Mahout, and Spark, the next wave of distributed analysis platforms. Learn how distributed computing works to be able to scale machine learning training on terabytes of data. The hands-on lab will take you through the process of setting up a Hadoop cluster to handle processing big data.

Topics covered

Distributed computing and cloud infrastructure

Hadoop
Hadoop Distributed File System
MapReduce
Hive
Mahout
Spark

Real-Time/IoT

Often the data that we are working with is not sitting in a database or files, it is being continuously streamed from a source. Network systems, sensor devices, 24-hour monitoring devices, and the like, are constantly streaming and recording data. Learn how to handle the end-to-end process of extracting the data, from extracting the data, to processing it, to filtering out important data and analyzing the data on the fly, near real-time. We take you through building your own end-to-end ETL (extract, transform, load) pipeline in the cloud. You will stream data from a source such as Twitter, credit card transactions, or a smartphone to an event ingestor. This processes the data and writes it out to cloud storage. You will then be able to read the data into Azure for analysis and processing.

Topics covered

Extract, transform, and load pipelines

Data ingestion
Event brokers
Stream storage
Azure Event Hub
Stream Processing
Event processors
Access rights and access policies
Querying streaming data and analysis

Kaggle Capstone

You will apply your learning, knowledge, and skills of data science throughout each day of the bootcamp. We coach you throughout the week to put those new skills to the test with a real problem. Kaggle’s Titanic survival prediction competition is the perfect testing ground to cut your teeth on. You’ll compete against your fellow students, with the top 2-3 contenders receiving a special prize.

Topics covered

Data pre-processing
Data cleaning
Feature Engineering
Model Training
Model Tuning

Self-Directed Labs

The world of data science and data engineering is larger than we have time to cover in the bootcamp. We want you to be as equipped to tackle this world as possible, so we have written a 350+ page textbook filled with step-by-step tutorials introducing you to many different tools. You will get a copy of this book at the bootcamp, allowing you to learn this additional information at your own pace.

Topics covered

Azure SQL Database
HBase

Hadoop
HDInsight
Azure PowerShell
Mahout
Spark
Live Twitter Sentiment Analysis

Supplementary Topics

Your data science learning journey with us does not end with your bootcamp. We have supplementary additional topics that supplement your learning from the bootcamp. These include many interesting topics with recorded video lectures, slide decks and practical exercises that you can go through at your own pace.

Topics covered

Naïve Bayes classifier
Logistic regression classifier

Time series forecasting
R Shiny interactive dashboards
Advanced feature engineering
Advanced model validation techniques
Support vector machines
Acing data science interviews

Get in touch

Feel free to ask questions or share your comments with us. We’ll get back to you soon.
You can also reach out to us by phone or email.

Connect With Us