Data Science and Data Engineering

Bootcamp Curriculum

The best data science bootcamp curriculum hands-down. Designed by practitioners in data science to get you started with real-world data science in just one week.

Schedule a meeting with an Advisor

Raja’s delivery is impressive, to say the least, and it makes the long days seem short. This is the best training I’ve received in my professional career. Read more “Mitchell Browning”

Mitchell Browning
Application Developer at TransCanada

…Most important, we discussed several examples of PRACTICAL applications, and I have several great ideas of how to apply these learnings at my company. I feel confident that I can put some of these concepts to use right away. Read more “Michelle (Garrett) Scarbrough”

Michelle (Garrett) Scarbrough
BCA Core Estimating Data Scientist at Boeing

I thought 10 hour days would go slow, but they actually went fast (and am glad they set the tone on day one that we would be there the entire time). Read more “Mike Dwyer”

Mike Dwyer
Sr SQL Developer at Société Générale

I have never learned so much practical knowledge in a week. I can go back to work and begin adding value with these new skills immediately. Read more “Matt Digel”

Matt Digel
Product Manager - Analytic Capabilities & Algorithms at Nike

It was a great 5 day workshop with getting some hands on experience and understanding the roots of data science. It made me work towards how data can be applied to solve real world problems. Read more “Lesha Bhansali”

Lesha Bhansali
Program Manager at Microsoft

The instructors made the bootcamp the most enjoyable and informative class I’ve ever taken. Read more “Jirui Qin”

Jirui Qin
Quant Research at Federal Reserve Bank of New York

This was drastically different than a lot of my education, and it’s exactly why I signed up for the course. Read more “Sofia Auer”

Sofia Auer
Data Scientist at National Research Council Canada

This is the greatest gift I could’ve given myself to jump forward in a field I’m passionate about.

Erika Rasmussen
Business Operations Manager at Cisco

I feel like I am walking out with a solid understanding of data science as well as the resources to go further. The bootcamp transformed me from not knowing where to start to being able to build my own predictive models! Read more “Tiffany Li”

Tiffany Li
Senior Data Scientist at Criteo

I would highly recommend this course to anyone seeking a very solid foundation in data science. It is intense and fast paced, but worthwhile and enjoyable. Read more “Jenna Butler”

Jenna Butler
Senior Software Engineer at Microsoft

This kind of information is not available in online tutorial or courses because it needs a deeper engagement to understand the intricacies involved. Read more “Manish Kumar Gupta”

Manish Kumar Gupta
Senior Software Engineer at Microsoft

The bootcamp is very well organized and you leave with plenty of material to continue learning data science on your own. Read more “Eunice Yang”

Eunice Yang
UX Researcher at PROS

Fantastic boot camp!!!

I was particular impressed with Raja’s grasp of the subject matter as well as the passion he has. Read more “Raji Easwaran”

Raji Easwaran
Principal Group Program Manager at Microsoft

I really enjoyed being able to learn data science concepts in a hands-on environment. Read more “Melia Mearns”

Melia Mearns
Workforce Analyst at Zappos Family of Companies

I like that it gave me the tools to go develop skills in specific areas (R, ML modeling in particular) on my own. Read more “Meghna Suresh”

Meghna Suresh
Head of Product at Replicant

I enjoyed the mix of hands on coding with R and high level data science principles. Read more “Alan Benson”

Alan Benson
Web data analyst at Apple
ModuleLessonTopicsDescriptionTimelineFormatSampleSample Video
Data Science FundamentalsImportance of 'Data' in Data ScienceSampling. Quantity, quality, and variety of data. Privacy, access control, legal, ethical and security issues in data acquisition.Beginners in data science often put too much emphasis on machine learning algorithms while ignoring the fact that garbage data will only produce garbage insights. Data quality is one of the most overlooked issues in data science. We discuss challenges and best practices in data acquisition, processing, transformation, cleaning and loading.Day 1Interactive discussion hour
Data Science FundamentalsData Exploration and VisualizationVarious data visualization and exploration techniques and packages. Interpreting boxplots, histograms, density plots, scatterplots and more. Segmentation and Simpson's paradox. Through a series of hands-on exercises and a lot of interactive discussions, we will learn how to dissect and explore data. We take different datasets and discuss the best way to explore and visualize data. We form hypothesis and discuss the validity of our hypothesis by using various data exploration and visualization techniques.Day 1Interactice discussion. R. Python hours
Data Science FundamentalsFeature EngineeringCalculating features from numeric features. Binning, grouping, quantizing, ratios and mathematical transforms for features in different applicationsFeature engineering is one of the most important aspects of building machine learning models. We will practice engineering new features, clean data before reporting or modeling.Day 1Interactice discussion. R. Python
Data Science FundamentalsStorytelling with DataCommunicating actionable insights. Various possible interpretations of plots. Storytelling with data. Bias in data acquistion, transformation, cleaning, modeling and interpretationExperienced data professionals will tell you that storytelling is one of the most important skills for communicating insights. We will practice the skill of storytelling while presenting analysis.Day 1Interactice discussion. R. Python
Predictive AnalyticsModeling a Real World Predictive Analytics ProblemFace detection. Adversarial machine learning. Spam detection. Translating a real world problem to a machine learning problemTaking a real world business problem and translating it into a machine learning problem takes a lot of practice. We will take some common applications of predictive analytics around us and discuss the process of turning that into a predictive analytics problem.Day 1Interactive discussion hour
Predictive AnalyticsSupervised Learning and ClassificationSupervised learning vs. Unsupervised learning. Features, predictors, labels, target values. Training, testing, evaluation.Supervised learning is about learning from historical data. We will understand some of the key assumptions in predictive modeling. We will discuss in what scenarios the distribution of future data will not remain the same as the historical data.Day 2Interactive discussionNone2 hours
Predictive AnalyticsDecision Tree ClassificationDecision tree learning. Impurity measures: Entropy and Gini index. Varying decision tree complexity by varying model parameters.We will start learning building predictive models by understanding decision tree classification in depth. We will start with an understanding of how we split nodes in a decision tree, impurity measures like entropy and Gini index. We will also understand the idea of varying the complexity of a decision tree by change decision tree parameters such as maximum depth, number of observations on the leaf node, complexity parameter etc.Day 2Interactive discussion. R. PythonNone2 hours
Predictive AnalyticsBuilding an evaluating a classification modelTrain/test split. Training, prediction and evaluation. Varying model hyperparameters such as maximum depth, number of observations on leaf node, minimum number of observations for splitting etc.We will build a classification model using decision tree learning. We will learn how to create train/test datasets, train the model, evaluate the model and vary model hyperparameters.Day 2R. PythonNone1 hour
Model Evaluation and SelectionEvaluation Metrics for Classification ModelsConfusion matrix, false/true positives and false/true negatives. Accuracy, pecision, recall, F1-score. ROC curve and area under the curve.Once we have understood how to build a predictive model, we will discuss the importance of defining the correct evaluation metrics. We will discuss real-world anecdotes to discuss under what circumstances one metric might be a better metric than the other.Day 2Interactive discussionNone2 hours
Model Evaluation and SelectionGeneralization and OverfittingGeneralization. Overfitting. Bias and variance. Repeatability. Bootstrap samplingBuilding a model that generalizes well requires a solid understanding of the fundamentals. We will understand what do we mean by generalization and overfitting. We will also discuss the ideas of bias and variance and how the complexity of a model can impact the bias and variance of our model.Day 2Interactive discussionNone2 hours
Model Evaluation and SelectionTuning of Model HyperparametersModel complexity. Bias and variance. K-fold cross validation. Leave one out cross validation. Time series cross validation.How do we build a model that generalizes well and is not overfit? The answer is by adjusting the complexity of machine learning model to the right level. This process known as hyperparameter tuning is one of the most important skills you will learn at the bootcamp. Using the decision tree learning parameters as an example we will observe how a model is impacted by creating a deeper or a shallow tree. We will do practical hyperparameter tuning exercises using cross validation.Day 2Azure ML, R, PythonNone1 hour
Ensemble MethodsBagging, Boosting and Random ForestBinomial Distribution, The importance of randomization and generalization in modeling, Sampling with replacement, Sampling without replacement, Bootstrapped sampling, Bagging, Boosting, Random Forests, AdaBoostAfter building a predictive model and understanding the pitfalls of wrong choice of evaluation metrics, we move to somewhat advanced learning techniques. We discuss the importance of ensemble techniques in machine learning and how they help us get machine learning models that are more generalized. The module goes in-depth into sampling with/without replacing, bootstrapped sampling, bagging, random forest and boosting. We discuss how ensemble methods utilize many different random subsets of data and combines the strength of many models to learn from many varied examples. The hands-on exercise will exercise your thinking in terms of how to choose an appropriate number of trees and the sampling techniques appropriate for the given problem.Day 2, Day 3R, Python, Azure ML hours
Unstructured DataIntroduction to Text AnalyticsStructured versus semi-structured versus unstructured data, Structuring raw text, Tokenization, Stemming and lemmatization, Stop word removal, Treating punctuation, casing, and numbers in text, Creating a terms dictionary, Drawbacks of simple word frequency counts, Term frequency – inverse document frequency, Document similarity measureNot always will you work with fully structured data. Many applications of data science require analysis of unstructured data such as text. We will teach you the basics of converting text into structured data, and how to model documents to find their similarities and recommend similar documents. We cover the important steps in pre-processing text in order to create textual features and prepare text for modeling or analysis. This includes stemming and lemmatization, treating punctuation and other textual components, stop word removal, and more. We also demonstrate how to model documents using term frequency-inverse document frequency and finding similar documents. The hands-on exercise looks at an example of analyzing text and introduces additional problems to solve in pre-processing text/documents.Day 3R, Python, Azure ML hours
Unsupervised LearningUnsupervised Learning and ClusteringReal-world problems that unsupervised learning algorithms solve, The K-means clustering algorithm, Euclidean distance measure, Defining k, The Elbow Method, Strengths and limitations of k-means clusteringAs one of the oldest branches of machine learning, unsupervised learning at its core is about revealing the hidden structure of any dataset. Not always are you going to be working with labeled data or records tagged with a label outcome. For example, collecting data on customer’s purchasing habits does not come with a label outcome of ‘high value customer’ or ‘low value customer’; that label needs to be created. We teach the underpinnings of the k-means clustering algorithm to solve this problem of finding the common attributes that separate out one cluster group from another. We can then use this to categorize our data based on clusters, or customers of similar attributes such as high value customers who all have similar spending habits. You will also learn how to approach an unsupervised learning challenge through a hands-on exercise and how to define your cluster groups.Day 3R, Python hours
RecommendersRecommender Systems and RankingCollaborative versus content recommenders, Data structure of collaborative versus content, Text recommenders, Search and recommenders, Pearson’s correlation, Cosine similarity, N nearest neighbors, Mean absolute error for recommenders, Root mean square error for recommenders, Discounted cumulative gainIn many ways, recommenders are the first and greatest problem of modern machine learning, and they are the engines which drive modern commerce. You will learn about the two types of recommenders, collaborative and content, and how to blend them to get the best of both worlds. We different types of recommenders such as text recommendation, search ranking. You’ll also learn the similarity measure for ranking recommended items and the prediction methods. We teach you how to evaluate of recommender and the metrics appropriate for this. You will then build and deploy a recommendation engine in Azure Machine Learning.Day 3, Day 4Azure ML hours
A/B TestingOnline Experimentation and A/B TestingA/B and multivariate tests, A/B metrics, Hypothesis testing in A/B tests, Type one and two errors, Confidence intervals, Conducting a t-test, Pitfalls in online experimentationOnline experimentation is perhaps the most misused of data science techniques. There are many errors that can creep into an experiment and test if not set up and conducted properly. We will walk through the best practices for designing and evaluating A/B and multi-variate tests. We discuss how to choose the appropriate metrics, how to detect and avoid errors, and how to properly interpret test results. Learn diverse examples of A/B and multivariate tests, hypothesis testing in A/B tests, type one and two errors, confidence intervals, t-tests, and more. Take part in an interactive game in class to illustrate different A/B and multivariate tests.Day 4R, Python hours
RegressionRegression and Predictive AnalyticsLinear regression and basic math notation, Cost function, Gradient descent, Batch gradient descent, Stochastic gradient descent, Regularizing regression models, Mean absolute error, Root mean square errorRegression and classification are the two sides of the supervised learning coin. You will learn how to adapt the techniques you have learned in predictive analytics and classification to the challenge of predicting numbers such as prices, revenues, click rates, and so on. We give you an overview of how regression models learn, teach you how to evaluate them, and demonstrate the use of regularization to prevent overfitting. Learn regression methods such as gradient descent, batch gradient descent, stochastic gradient descent, and the differences between these. Learn about the cost function in gradient descent and how it is used to find the optimal model fit. We end with a hands-on exercise to solidify how regression works and use it to predict house prices.Day 4R, Python, Azure ML hours
Data EngineeringBig Data EngineeringDistributed computing and cloud infrastructure, Hadoop, Hadoop Distributed File System, MapReduce, Hive, Mahout, SparkThe first challenge of big data isn’t one of analysis, but rather of volume and velocity. How do you process terabytes of data in a reliable, relatively rapid way? We teach you the basics of MapReduce and Hadoop Distributed File System, the technologies which underly Hadoop, the most popular distributed computing platform. We also introduce you to Hive, Mahout and Spark, the next wave of distributed analysis platforms. Learn how distributed computing works to be able to scale machine learning training on terabytes of data. The hands-on lab will take you through the process step-by-step on setting up a Hadoop cluster to handle processing big data.Day 5Azure hours
Data EngineeringReal-time/IoTExtract, transform, and load pipelines, Data ingestion, Event brokers, Stream storage, Azure Event Hub, Stream Processing, Event processors, Access rights and access policies, Querying streaming data and analysisOften the data that we are working with is not sitting in a database or files, it is being continuously streamed from a source. Network systems, sensor devices, 24-hour monitoring devices, and the like, are constantly streaming and recording data. Learn how to handle the end-to-end process of handling these data, from extracting the data, to processing it, to filtering out important data and analyzing the data on the fly, near real-time. We take you through building your own end-to-end ETL (extract, transform, load) pipeline in the cloud. You will stream data from a source such as Twitter, or credit card transactions, or a smartphone to an event ingestor. This processes the data and writes it out to cloud storage. You will then be able to read the data into Azure for analysis and processing.Day 5Azure ML, Azure Stream Analytics hours
Data EngineeringDeploying a Predictive Model as a ServiceREST Endpoints, APIsA user-interface into a model makes it easier to see how it would work in the real world, where a new customer enters the systems and data is collected on their age, gender, and so on. We teach you direct and simple processes for setting up real-time prediction endpoints in the cloud, allowing you to access your trained model from anywhere in the world. We walk you through constructing your own endpoints and show a few practical demos of how this can be used to expose a predictive model to anyone you’d like to use it and see how it takes new data and makes a prediction.Day 5Azure MLNone1 hour
Bootcamp PreparationIntroduction to Big Data, Data Science and Predictive AnalyticsBig Data, ETL Pipelines, Data Mining, Predictive AnalyticsWe introduce you to the wide world of Big Data, throwing back the curtain on the diversity and ubiquity of data science in the modern world. We also give you a bird's eye view of the subfields of predictive analytics and the pieces of a big data pipeline.Pre-BootcampNone hours
Bootcamp PreparationFundamentals of Data MiningDataset types, Data preprocessing, Similarity, Data explorationAll great learning opportunities are built on a solid foundation. This session is jam-packed with all the background information, technical terminology, and basic knowledge that you will need to hit the ground running on the first day of the bootcamp.Pre-BootcampNone hours
Bootcamp PreparationIntroduction to R ProgrammingR basics, R data types, R language features, R visualizationHere we introduce the basics of the R programming language. R is a free, open-source statistical programming platform. It is designed to make many of the most common data processing tasks as simple as possible. With this knowledge, you'll be able to engage fully with the hands-on exercises in the class.Pre-BootcampR2 hours
Bootcamp PreparationIntroduction to Azure Machine LearningAzure ML basics, Azure ML preprocessing, Azure ML visualizationAzure Machine Learning Studio is a fully featured graphical data science tool in the cloud. You will learn how to upload, analyze, visualize, manipulate, and clean data using the clean and intuitive interface of Azure MLPre-BootcampAzure ML1.5 hours
Continued LearningKaggle CapstoneFeature Engineering, Model Training, Model Evaluation, Model TuningYou will apply your learning, knowledge and skills of data science throughout each day of the bootcamp. We coach you throughout the week to put those new skills to the test with a real problem. Kaggle's Titanic survival prediction competition is the perfect testing ground to cut your teeth on. You'll compete against your fellow students, with the top 2-3 contenders receiving a special prize.Day 1R, Python, Azure MLNone5 days
Continued LearningNaive BayesConditional Probability, Bayes' Rule, Independence, Naive BayesNaive Bayes is one of the most popular and widely used classfication algorithms, particularly in text analysis. It is also a simple, fast, and small algorithm suitable for use on datasets of any size. We teach you how Naive Bayes works, why it works, and when it is likely to break down.Post-BootcampR, Python hour
Continued LearningLogistic RegressionCost Functions, Logit Function, Decision BoundariesLogistic Regression is one of the oldest and best understood classification algorithms. While not suitable for every application, it is fast to run and cheap to store. We will teach you how logistic regression fits a dataset to make predictions, as well as when and why to use it.Post-BootcampR, Python, Amazon ML hour
Continued LearningIntroduction to NoSQL DatabasesCAP theorem, NoSQL, HBaseWith the massive increase in velocity and volume of data, even the largest and fastest SQL database lags under the load of millions of requests per second. We teach you how NoSQL databases solve this problem, sacrificing a small amount of consistency for a massive increase in durability.Post-BootcampAzure hour
Continued LearningSelf Directed LabsAzure SQL Database, HBase, Hadoop, HDInsight, Azure PowerShell, Mahout, Spark, Live Twitter Sentiment AnalysisThe world of data science and data engineering is larger than we have time to cover in the bootcamp. We want you to be as equipped to tackle this world as possible, so we have written a 350+ page textbook filled with step by step tutorials introducing you to many different tools. You will get a copy of this book at the bootcamp, allowing you to learn this additional information at your own pace.Post-BootcampAzure, Amazon, Hadoop, Spark - 4 weeks
Continued LearningLive Practice WebinarsNumerous data science topics from Time Series Forecasting, to Churn Prediction, to Resume Preparation, and more.Your learning does not stop after the bootcamp. You’ll be able to tune into a live webinar and keep practicing your skills with a walk-through example or exercise on a new topic every two weeks. Master your art and strengthen your skills with regular practice. The webinars will also be recorded to view at a more convenient time.R, PythonNone1-1.5 hours every 2 weeks

Contact Us

Useful Links