Data Science Bootcamp
Comprehensive bootcamp to get you started with practical data science and data engineering
Next Kickoff:
August 16
Online
Every Tuesday, 9AM – 12PM PDT
Select a start date that fits your schedule.
Trusted by Leading Companies
Learn Practical Data
Science
We have carefully designed our data science bootcamp to bring you the best practical exposure in the world of data science, programming, and machine learning. With our comprehensive curriculum, interactive learning environment, and challenging real-world exercises, you’ll learn through a practical approach.
Our curriculum includes the right mix of lectures and hands-on exercises, along with office hours and mentoring. Our data science training employs a business-first approach to help you stand out in the market.
Key Features
- Instructor-led training
- Dedicated office hours
- Strong alumni network
Hands-on Coding Environment
- Code from your browser
- Programming tutorials
- Additional exercises
Continued Learning
Your journey doesn’t end with the bootcamp! we have a rich repository with tons of resources to keep you going. Our tutorials, demos, and exercises will be available even after the program to help you practice your newfound data science skills.
- Post-bootcamp tutorials
- Publicly available datasets
- Blogs and learning material
Our Bootcamp Curriculum
Data Exploration and Visualization, and Feature Engineering
We begin this module by first discussing how the performance of algorithms is directly dependent on the quality of the data and discuss the challenges and best practices in data acquisition, processing, transformation, cleaning, and loading.
After that, through a series of hands-on exercises and a lot of interactive discussions, we discuss how to dissect and explore data to improve our understanding of data.
We form hypotheses and discuss the validity of our hypotheses by using data exploration and visualization. In the end, we discuss how feature engineering techniques are used to extract the most relevant features that could be used for building models.
The module provides hands-on practice with in-class and supplementary exercises in R/Python and quizzes to validate your knowledge.
Topics:
- Understanding how and why “data beats algorithms”
- Importance of data cleaning, data pre-processing, and business domain knowledge
- Interpretation of boxplots, histograms, density plots, scatterplots, and more
- Data visualization with ggplot2 in R and Matplotlib and seaborn libraries
- Feature engineering on categorical/numerical features
Storytelling with Data
The end goal of a data exploration and visualization exercise is to communicate insights to others, most often a business audience. In this module, we discuss how to steer our data exploration and data visualization in a way that it could answer business questions and communicates insights to everyone, regardless of where they lie on the data literacy scale.
We begin this module with an interactive discussion on simple data visualizations like boxplots, and scatterplots but after we wrap hypothetical business contexts around them. We interpret the data visualizations within those business contexts, ask business questions and discuss what business insights we can draw.
We practice data visualization and learn the art and skill of storytelling while presenting analysis.
The module provides hands-on practice with in-class and supplementary exercises in R/Python and quizzes to validate your knowledge.
Topics:
- Understanding that goal of data visualization is communicating insights
- Interactive discussion on various interpretations of plots
- Learning how to identify data visualizations most appropriate to answer business questions
Predictive Modeling for Real World Problems
Taking a real-world business problem and translating it into a machine learning problem takes a lot of practice. We begin this module by discussing some of the typical business applications of predictive analytics around us and the process of turning the underlying business problem into a predictive analytics problem.
Once we have given an overview of how predictive modeling works, we discuss the key difference between supervised and unsupervised machine learning and take a deep dive into supervised learning. Discuss how predictive modeling works on the key assumption that the future data will remain the same as historical data. We also discuss scenarios where this assumption no longer holds
The module provides hands-on practice with in-class and supplementary exercises in R/Python/Azure and quizzes to validate your knowledge
Topics:
- Real-world applications of predictive modeling in action
- Steps involved in translating a real-world problem to a machine learning problem
- Difference between supervised learning and unsupervised learning
- Introduction to classification as one of supervised learning
- Understanding features, labels, target values, training, testing, and evaluation
Decision Tree Learning
We continue our discussion on classification as one of the most common types of supervised learning. We introduce the decision tree algorithm and discuss how it is used for building classification models.
We discuss the decision trees algorithm for classification in detail, understanding how a node gets split in a decision tree and how entropy and Gini index are calculated to measure node impurity. We will also understand the idea of varying the complexity of a decision tree by changing decision tree algorithm parameters such as maximum depth, number of observations on the leaf node, complexity parameter, etc.
The module provides hands-on practice with in-class and supplementary exercises in R/Python/Azure and quizzes to validate your knowledge.
Topics:
- Intuitive understanding of decision tree algorithm
- Understanding terms like root node, splitting node, splitting criteria, levels, leaf node, number of observations on a leaf node, the minimum number of observations for splitting
- Calculating node impurity measures -entropy, misclassification error, and Gini index
- Understanding greedy algorithm and how a decision tree is grown
- Varying decision tree complexity by varying model parameters
- Varying model hyperparameters such as maximum depth, max samples per leaf node
Evaluation of Classification Models
As data scientists, our job does not end with model building. We also need to know how well the model will perform on new, unseen data.
In this module, we discuss different metrics to evaluate classification models. We explain a confusion matrix and how to calculate evaluation metrics like accuracy, precision, and recall. We discuss real-world examples of why accuracy may not be the best metric to look at and under what circumstances precision or recall might be better.
Once we developed a good understanding of evaluation metrics, we discussed concepts of generalization and overfitting. We will discuss the ideas of bias and variance and how the complexity of a model impacts the bias-variance trade-off. We introduce the concept of cross-validation and discuss the k-fold cross-validation in detail.
The module provides hands-on practice with in-class and supplementary exercises in R/Python/Azure and quizzes to validate your knowledge.
Topics:
- Limitations of accuracy as a metric for model evaluation
- Introduction to confusion matrix
- Understanding true positive, false positive, true negative, and false negative
- Deriving accuracy, precision, recall, and F1 score
- When to use accuracy, precision, and recall for evaluation
- Generalization and overfitting of models
- Bias-variance trade-off
- K-fold cross-validation
- Understanding the ROC curve and its use for model comparison
Tuning of Model Hyperparameters
How do we build a model that generalizes well and does not overfit? The answer is by fine-tuning the model complexity to the right level. This process is formally known as “hyperparameter tuning,” It is one of the essential skills for a data scientist. We introduce hyperparameter tuning taking classification models as an example.
We discuss decision tree hyperparameters like max depth, minimum sample per split, and minimum sample for the leaf node. We adjust these parameters to obtain different decision tree models representing deeper or shallower trees and compare their performance on evaluation metrics. We also discuss grid search, an auto hyperparameter tuning method that searches for optimal hyperparameters by testing all possible combinations of hyperparameters
The module provides hands-on practice on hyperparameter tuning using cross-validation with R/Python/Azure exercises.
Topics:
- Understanding model complexity, bias, and variance
- Difference between model parameters and hyperparameters
- Decision tree hyperparameters, max depth, minimum sample per split, minimum sample for the leaf node
- K-fold, leave–one–out, and nested cross-validation
- Cross-validation with time-series data
- Grid search for auto hyperparameter tuning
Ensemble Methods, Bagging and Random Forest
In the previous modules, we have built a solid understanding of bias, variance, and generalization concepts. In this module, we introduce the idea of how using predictions from not just one but multiple models, formally called an ensemble improves generalization.
Before discussing this seemingly counter-intuitive idea, we explain bootstrap sampling and binomial distribution that are key to understanding why ensembles learning performs well.
We discuss bootstrap aggregation or bagging and explain how a bagged decision tree model is built using multiple random sample rows from the training data. We discuss the idea of feature/column randomization, explain how feature randomization helps overcome the greediness of decision tree learning, and make a case of Random Forest.
The module provides hands-on practice in R/Python/Azure on selecting the appropriate number of trees, several random features, and other tuning parameters to build a Random Forest.
Topics:
- Understanding the idea of ensemble learning
- Binomial distribution and explaining why ensemble learning performs well
- Review of bias/variance, overfitting, and generalization
- Sampling with/without replacement bootstrapped sampling
- A quick review of decision tree splits
- Bagging and bagged decision trees
- Random forest model and column randomization trick
- Explaining why random forestgives more generalized models
- Random forest hyperparameter Tuning
Boosting
Boosting is an immensely powerful and understandably popular technique. Boosting involves an iterative process to adaptively change training data distribution by focusing more on previously misclassified records. We discuss the fundamental ideas behind boosting.
We also understand how one can alter the sampling distribution while sampling for each round of boosting. After having an intuitive understanding of boosting, we introduce AdaBoost as an example. We explain the mechanics of AdaBoost, weight update for training data, altering the sampling distribution, and weight update for the models in an ensemble.
We also discuss the strength and weaknesses of boosting and the potential pitfalls of boosting. The module provides hands-on practice in R/Python to build boosted decision tree classifiers.
Topics:
- Boosting intuition
- Altering a sampling distribution
- Update of weights of training data points and models in the ensemble
- Adaptive boosting or AdaBoost
- Penalty function
- Strengths and weaknesses of boosting
Online Experimentation and A/B Testing
Online experimentation provides a scientific, evidence-based process, rather than an intuitive reaction, frequently used to assess new ideas for a website, a marketing campaign, or an email.
We kick off this module with a group activity to discuss which of the two given creative ideas, newsletters, call-to-action buttons, personas, or keywords performed better. We introduce A/B testing and multivariate testing. We teach online experiments, explain the concept of test and control, and discuss hypothesis testing fundamentals without getting into too much math.
In this context, we discuss the various steps in an online experiment emphasizing the importance of each step. We also discuss the potential pitfalls in an online experimentation pipeline.
The module provides hands-on practice with in-class and supplementary exercises in R/Python and quizzes to validate your knowledge.
Topics:
- Some interesting online experiments that defy intuition
- Introduction to A/B testing and multivariate testing
- Difference between online and offline metrics
- Introduction to online experiments, tests, and control
- Fundamentals of hypothesis testing
- Type I, Type II error and interactions
- Understanding Z-test, t-test, confidence interval, and p-value
- Steps in online experimentation
- Pitfalls of online experimentation
Deploying a Model as a Service
Understanding machine learning concepts is essential for model development. However, we must also learn how to make a machine learning model available to its end-users. Deployment is how we integrate a machine learning model into an existing production environment and make it available for decision-making based on input data.
In this module, we discuss how machine learning works in the cloud. We walk you through a step-by-step process of setting up real-time prediction endpoints in Azure Machine Learning Studio and give a demo of how the deployed model offers predictions for the input data.
The module is primarily a hands-on lab where you will learn how to run the model, deploy, and test the web service in the Azure ML studio.
Topics:
- Introduction to machine learning in the cloud with practical demos
- Training experiment and converting it to predictive experiment in Azure ML Studio
- Deploying a predictive experiment as an Azure ML web service
- Connecting to an Azure ML web service with an API key.
- Hands-on experiment to run the model, deploy and test the web service.
Text Analytics Fundamentals
We do not always work with fully structured data. Many applications of data science require analysis of unstructured data such as text.
In this module, we will discuss the basics of converting text into structured data and how to model documents to find their similarities and recommend similar documents. We discuss the crucial steps in the pre-processing text to create text features and prepare the text for modeling or analysis. This includes stemming and lemmatization, treating punctuation and other text components, stop words removal, and more.
We also demonstrate how to model documents using the term frequency-inverse document frequency (TF-IDF) and find similar documents.
The hands-on exercise in R/Python/Azure looks at an example of analyzing text and introduces additional problems to solve pre-processing text/documents
Topics:
- Understanding structured, semi-structured, and unstructured data
- Converting text into structured data
- Tokenization, stemming, lemmatization, stop words removal, treating punctuation, casing, and numbers in the text
- Terms dictionary, word embedding, and count vectorization
- Drawbacks of simple word frequency counts
- Term frequency – inverse document frequency (TF – IDF)
- Document similarity measure
Unsupervised Learning with K-means Clustering
We don’t always get to work with labeled data. For example, data on customers’ purchasing habits do not come with labels such as ‘high-value customer’ or ‘low-value customer’; that label needs to be created.
We use unsupervised learning rather than supervised learning that requires labeled datasets to reveal the hidden structure in a dataset and use them to create clusters or segments.
In this module, we discuss the underpinnings of the k-means clustering algorithm to solve the problem of finding the common attributes that separate one cluster or segment within the data from others.
We discuss how to approach an unsupervised learning challenge through a hands-on exercise and how to define your cluster groups.
The module provides hands-on practice with in-class and supplementary exercises in R/Python and quizzes to validate your knowledge.
Topics:
- Real-world problems that unsupervised learning algorithms solve
- Deep dive into K-means clustering algorithm
- Understanding Euclidean distance, intra-cluster, and inter-cluster distances
- Defining K, the estimated number of clusters to begin with
- Using the elbow method to find the optimal number of clusters
- Strengths and limitations of k-means clustering
Linear Models for Regression
We begin the module with a discussion of calculus fundamentals that help in understanding the math behind finding the minimum of the cost function. We discuss the cost function for a linear regression model and how gradient descent finds the minimum of the cost function. We compare the batch, stochastic and mini-batch approaches to minimize the cost function.
We discuss the different evaluation metrics for a regression model and appropriate scenarios for using them.
We discuss the intuition behind regularization and the penalty parameter. We discuss the L1 and L2 penalty and give a quick overview of lasso and ridge regression.
The module provides hands-on practice with in-class and supplementary exercises in R/Python/Azure and quizzes to validate your knowledge.
Topics:
- Review of calculus fundamentals, derivatives, gradients, minima, and maxima
- Convexity of functions and why convexity matters
- The cost function for a linear regression
- Batch, stochastic, and mini-batch gradient descent
- Global vs. local minima
- Evaluation of regression models, MAE, RMSE, R2, and adjusted- R2
- Understanding the intuition behind regularization
- L1 penalty and lasso regression
- L2 penalty and ridge regression
Ranking and Recommendation Systems
From the next show on Netflix to suggestions for a new friend on Facebook, recommender systems are all around us here. In this module, we discuss collaborative and content-based recommenders at high-level and discuss how items are recommended in each case.
We discuss various strategies for building items and user profiles. Recommender systems work on the concept of similarity. We discuss some approaches to measure similarity and when to use which similarity measure.
We discuss the different scenarios a recommender system may be used. We discuss the difference between a ranking problem and a regression problem and which metrics would be the right for a given problem.
The module provides hands-on practice with in-class and supplementary exercises in R/Python/Azure and quizzes to validate your knowledge.
Topics:
- Collaborative versus content-based recommenders
- The data structure of collaborative versus content-based recommenders
- Building user profiles and item profiles
- Users and items similarity, Pearson‘s correlation, and cosine similarity
- Mean absolute error (MAE), root mean square error (RMSE) for recommender evaluation
- Discounted Cumulative Gain (DCG) and nDCG for ranking evaluation
Big Data Engineering with Distributed Systems
A good data scientist should also possess good data engineering skills. This module introduces you to extensive data engineering and machine learning at scale.
We discuss the difference between scaling up and scaling out. We discuss distributed system architecture and what a typical modern enterprise’s significant data architecture looks like.
We discuss the basics of popular distributed computing frameworks like Hadoop, MapReduce, and Hadoop Distributed File System, the technologies which underly Hadoop, the most popular distributed computing platform. We also introduce you to Hive, Mahout, and Spark, the next wave of distributed analysis platforms. You will learn how distributed computing works to scale machine learning training on terabytes of data.
The module includes a hands-on lab for step-by-step setting up a Hadoop cluster to handle big data processing with Microsoft Azure.
Topics:
- Data engineering for data scientists
- Machine learning at scale
- SaaS, PaaS, and IaaS
- Scaling up versus Scaling out
- Distributed computing, cloud infrastructure, distributed computing frameworks
- Understanding Hadoop, MapReduce, HDFS, Hive, Mahout, Spark
- A modern enterprise big data architecture
- Data warehouse versus Data Lakes
Real-time Analytics and Internet of Things
Often the data we are working with is not lying in a database or files but continuously streamed from a source such as network systems, sensor devices, 24-hour monitoring devices, etc.
In this module, we will discuss how to manage the end-to-end process of streaming data, from extracting the data to processing it to filtering out essential data and analyzing the data on the fly in near real-time.
We will discuss what a typical event processing pipeline looks like. We will introduce the concepts of data ingestion and stream processing and explain terms like event hub, event brokers, and stream processor.
The module includes a hands-on lab to build an end-to-end ETL (extract, transform, load) pipeline in the cloud. Once you have the end-to-end pipeline, you will stream data from a source such as Twitter, credit card transactions, or a smartphone to an event ingestor that processes the data and writes it out to cloud storage. You will then be able to read the data into Azure for analysis and processing.
Topics:
- Intuitive understanding of real-time analytics and streaming data
- Understanding the ETL pipelines for real-time analytics
- Data ingestion, event hub, event brokers, and stream processor
- Working with Azure Event Hub for streaming data
- Access rights and access policies
- Querying streaming data and analysis
Earn a Verified Certificate of Completion
In association with
Recommended by Practitioners
At the end of the bootcamp I think all of us are at the same place, so that’s the beauty of this program. You could come from any background because we are covering some diverse topics here, and making sure it’s a level playing field and again, going back to to the motto of, hey, this is for everyone. Kapil Pandey, Analytics Manager at Samsung
It was a great experience for increasing the expertise on data science. The abstract concepts were explained well and always focused on real applications and business cases. The pace was adjusted as needed to let everyone follow the topics. Week was intense as there are many topics to cover but schedule was well managed to optimize people attention.Harris Thamby, Manager at Microsoft
What I enjoyed most about the Data Science Dojo bootcamp was the enthusiasm for data science from the instructors.Eldon Prince, Senior Principal Data Scientist at DELL
Highly valuable course condensed into a single week. Enough background is given to allow one to continue their learning and training on their own.Good energy from the instructors. It is clear that they have real industry experience working on problems.Ben Gawiser, Software Engineer at Amazon
I’m really impressed by the quality of the bootcamp, I came with high expectation and Data Science Dojo exceeded it. I highly recommend the bootcamp to anyone interested in Data Science!Marcello Azambuja, Engineering Manager at Uber
With the knowledge I’ve gained from this bootcamp I can further add value to my clients. Data Science Dojo is the only training which provides alot of useful content and now I can confidently make a predictive model in few minutes.Iyinola Abosede-Brown, Senior Technology Consultant at KPMG
Taught by Practitioners
Learning Plans and Schedule
Late Summer 2023
Fall 2023
Winter 2024
Late Summer 2023
Only 1 seat remaining at this price.
30% OFF
Dojo
$2659
$3799
- Instructor-led training
- Course material
- Restricted access to learning platform