Data Science Bootcamp

Remote training designed to enhance learning and give students a practical
understanding of data science and machine learning.

Live Instructor
16-week
5-6 hours per week

Upcoming

Winter Cohort

Start Date

Jan 13, 2021

Timings (PST)

5 pm – 8 pm

Overview

Our remote data science training is taught by the same best-rated instructors who taught the 5-day, in-person bootcamp, previously. The curriculum, exercises, collaboration tools, data science competition, learning tools and projects are the same for both learning formats.

The bootcamp covers several important machine learning and data science subjects to equip you in implementing predictive models and data pipelines end-to-end, defining metrics, evaluation, parameter tuning, storytelling, understanding the underpinning theory and concepts, and ingraining the critical thinking process behind data science.

We focus on what matters – not the hype.

unm-data-science-bootcamp

You will earn a data science certificate with 7 Continuing Education Units through
The University of New Mexico.

Scheduled Cohorts

Winter Cohort

Date

Jan 13, 2021
Timings (PST)
5 pm – 8 pm PST
Wednesday

Spring Cohort

Date

Mar 23, 2021
Timings (PST)
9 am – 12 pm PST
Tuesday

Summer Cohort

Date

June 9, 2021
Timings (PST)
5 pm – 8 pm PST
Wednesday

What you will learn

Beginners in data science often put too much emphasis on machine learning algorithms while ignoring the fact that garbage data will only produce garbage insights. Data quality is one of the most overlooked issues in data science. We discuss challenges and best practices in data acquisition, processing, transformation, cleaning and loading.
Topics covered
  • Calculating features from numeric features
  • Binning
  • Grouping
  • Quantizing
  • Ratios
  • Mathematical transforms for features in different applications
Sample slide Sample video
Through a series of hands-on exercises and a lot of interactive discussions, we will learn how to dissect and explore data. We take different datasets and discuss the best way to explore and visualize data. We form hypothesis and discuss the validity of our hypothesis by using various data exploration and visualization techniques.
Topics covered
  • Various data visualization and exploration techniques and packages
  • Interpreting boxplots
  • Histograms
  • Density plots
  • Scatterplots
  • Segmentation and Simpson’s paradox
Sample slide Sample video
Feature engineering is one of the most important aspects of building machine learning models. We will practice engineering new features, clean data before reporting or modeling.
Topics covered
  • Calculating features from numeric features
  • Binning
  • Grouping
  • Quantizing
  • Ratios and mathematical transforms for features in different applications
Sample slide Sample video
Experienced data professionals will tell you that storytelling is one of the most important skills for communicating insights. We will practice the skill of storytelling while presenting analysis.
Topics covered
  • Communicating actionable insights
  • Various possible interpretations of plots
  • Storytelling with data
  • Bias in data acquistion
  • Transformation
  • Cleaning
  • Modeling
  • Interpretation
Sample slide Sample video
Taking a real world business problem and translating it into a machine learning problem takes a lot of practice. We will take some common applications of predictive analytics around us and discuss the process of turning that into a predictive analytics problem.
Topics covered
  • Face detection
  • Adversarial machine learning
  • Spam detection
  • Translating a real world problem to a machine learning problem
Sample slide Sample video
Supervised learning is about learning from historical data. We will understand some of the key assumptions in predictive modeling. We will discuss in what scenarios the distribution of future data will not remain the same as the historical data.
Topics covered
  • Supervised learning vs. Unsupervised learning
  • Features
  • Predictors
  • Labels
  • Target values
  • Training
  • Testing
  • Evaluation
We will start learning building predictive models by understanding decision tree classification in depth. We will start with an understanding of how we split nodes in a decision tree, impurity measures like entropy and Gini index. We will also understand the idea of varying the complexity of a decision tree by change decision tree parameters such as maximum depth, number of observations on the leaf node, complexity parameter etc.
Topics covered
  • Decision tree learning
  • Impurity measures: Entropy and Gini index
  • Varying decision tree complexity by varying model parameters
We will build a classification model using decision tree learning. We will learn how to create train/test datasets, train the model, evaluate the model and vary model hyperparameters.
Topics covered
  • Train/test split
  • Training, prediction and evaluation
  • Varying model hyperparameters such as maximum depth
  • Nnumber of observations on leaf node
  • Minimum number of observations for splitting
Once we have understood how to build a predictive model, we will discuss the importance of defining the correct evaluation metrics. We will discuss real-world anecdotes to discuss under what circumstances one metric might be a better metric than the other.
Topics covered
  • Confusion matrix
  • false/true positives and false/true negatives
  • Accuracy
  • Precision
  • Recall
  • F1-score
  • ROC curve and area under the curve
Building a model that generalizes well requires a solid understanding of the fundamentals. We will understand what do we mean by generalization and overfitting. We will also discuss the ideas of bias and variance and how the complexity of a model can impact the bias and variance of our model.
Topics covered
  • Generalization
  • Overfitting
  • Bias and variance
  • Repeatability
  • Bootstrap sampling
How do we build a model that generalizes well and is not overfit? The answer is by adjusting the complexity of machine learning model to the right level. This process known as hyperparameter tuning is one of the most important skills you will learn at the bootcamp. Using the decision tree learning parameters as an example we will observe how a model is impacted by creating a deeper or a shallow tree. We will do practical hyperparameter tuning exercises using cross validation.
Topics covered
  • Model complexity
  • Bias and variance
  • K-fold cross validation
  • Leave one out cross validation
  • Time series cross validation
Mathematical understanding of concepts is easier when we start with developing an intuition for the (may be not so) complex math behind an apparently complex topic. Having built a solid understanding of the concepts of bias, variance and generalization, we explain why building a committee of models improves generalization. We also review math topics such as bootstrap sampling and binomial distribution that are key to understanding why ensembles work so well.
Topics covered
  • Binomial distribution
  • Review of bias/variance
  • Overfitting and generalization
  • Sampling with/without replacement
  • Bootsrtaped sampling
Sample slide Sample video
Having understood bagging very well, we segue the discussion into the idea of feature/column randomization. We explain how feature randomization helps overcome the greediness of decision tree learning and make a case of Random Forest.
Topics covered
  • Quick review of decision tree splits
  • Column randomization trick and why it is helpful in building more generalized models
Sample slide Sample video
Hands-on exercise to select the appropriate number of trees, number of random features and other tuning parameters in a Random Forest and variants of the technique.
Topics covered
  • Tuning parameters like depth
  • Number of trees
  • Number of random features selected etc
  • Using R/Python libraries and Azure ML Studio to tune a model
Sample slide Sample video
Boosting is an immensely powerful and understandably popular technique. We discuss the fundamental ideas behind boosting. We also get an intuitive understanding of how one can alter the sampling distribution while sampling for each round of boosting.
Topics covered
  • Strength of weak learners
  • Boosting intuition
  • Altering a sampling distribution
Sample slide Sample video
Armed with an intuitive understanding of boosting, we pick AdaBoost as an example. We explain the mechanics of AdaBoost, weight update for training data, altering the sampling distribution and weight update for the models in an ensemble. We also discuss the strength and weaknesses of boosting and potential pitfalls of boosting
Topics covered
  • AdaBoost
  • Update of weights of training data points and models in the ensemble
  • Penalty function
  • Strength and weaknesses of boosting
Sample slide Sample video
Not always will you work with fully structured data. Many applications of data science require analysis of unstructured data such as text. We will teach you the basics of converting text into structured data, and how to model documents to find their similarities and recommend similar documents. We cover the important steps in pre-processing text in order to create textual features and prepare text for modeling or analysis. This includes stemming and lemmatization, treating punctuation and other textual components, stop word removal, and more. We also demonstrate how to model documents using term frequency-inverse document frequency and finding similar documents. The hands-on exercise looks at an example of analyzing text and introduces additional problems to solve in pre-processing text/documents.
Topics covered
  • Structured versus semi-structured versus unstructured data
  • Structuring raw text
  • Tokenization
  • Stemming and lemmatization
  • Stop word removal
  • Treating punctuation, casing, and numbers in text, Creating a terms dictionary
  • Drawbacks of simple word frequency counts
  • Term frequency – inverse document frequency
  • Document similarity measure
Sample slide Sample video
Unsupervised learning at its core is about revealing the hidden structure of any dataset. Not always are you going to be working with labeled data or records tagged with a label outcome. For example, collecting data on customer’s purchasing habits does not come with a label outcome of ‘high value customer’ or ‘low value customer’; that label needs to be created. We teach the underpinnings of the k-means clustering algorithm to solve this problem of finding the common attributes that separate out one cluster group from another. We can then use this to categorize our data based on clusters, or customers of similar attributes such as high value customers who all have similar spending habits. You will also learn how to approach an unsupervised learning challenge through a hands-on exercise and how to define your cluster groups.
Topics covered
  • Real-world problems that unsupervised learning algorithms solve
  • The K-means clustering algorithm
  • Euclidean distance measure
  • Defining k
  • The Elbow Method
  • Strengths and limitations of k-means clustering
Sample slide Sample video
Recommder systems are all around us here. We discuss the collaborative and content-based recommenders at high-level. We also discuss how are items recommended in each case. Various strategies for building item and user profiles are also discussed.
Topics covered
  • Collaborative versus content recommenders
  • Data structure of collaborative versus content-based recommnders
  • Building user profiles and item profiles
Sample slide Sample video
Both collaborative and content-based recommenders rely on similarity but how do we find similarity between vectors. We discuss some approaches to measure similarity and when to use which similarity measure.
Topics covered
  • Pearson’s correlation
  • Cosine similarity. N nearest neighbors
  • Weighted and centered metrics
Sample slide Sample video
We discuss the different scenarios a recommender system may be used. We discuss the difference between a ranking problem and a regression problem and discuss which metrics would be the right metrics for a given problem.
Topics covered
  • Mean absolute error
  • Root mean square error
  • Discounted Cumulative Gain (DCG) and nDCG for ranking evaluation
Sample slide Sample video
Design of experiments, hypothesis testing is one of the most useful tools in data science. We kick off our discussion with a discussion on why online experimentation is needed in the first place. We also discuss the difference between online and offline metrics. We will have a group activity to discuss the hypothetical ‘Facebook’, ‘Amazon’, and ‘Google’ examples of online metrics.
Topics covered
  • A/B Testing
  • Multivariate tests
  • Some interesting online experiments that defy intuition
  • Online vs. offline metrics
Sample slide Sample video
Desiging and running experiments depends upon a good understanding of hypothesis testing fundamentals. We offer a quick overview to hypothesis testing with all the necessary concepts. We take a practical example and calculate confidence intervals with varying confidence values assuming a small and big sample size. We explain the fundamental in an intuitive manner without being too involved in the mathematical details.
Topics covered
  • Control
  • Treatment and hypothesis testing
  • Type I, Type II error and interactions
  • Confidence interval and p-values
  • Z-table and t-table
Sample slide Sample video
Running online experiments in real-world is both a science and an art. We discuss the various steps in an experiment and emphasize the importance of each step. We also discuss the potential pitfalls in an online experimentation pipeline.
Topics covered
  • Steps in online experimentation: Choosing treatment, control and factors
  • Sample size selection
  • Effect size. A/A tests
  • Logging and instrumentation
  • Segmentation and interpretation
Sample slide Sample video
Before talking about linear models, we setup the mathematical foundations of regression models. We start with a discussion of some calculus fundamentals to be able to transition seemlessly into the math behind finding the minimum of the cost function eventually.
Topics covered
  • Introduction
  • Derivatives and gradients
  • Minima/maxima
  • Covexity functions and why convexity matters
Sample slide Sample video
With the mathematical background already setup, we intuitively understand what should be the cost function for a linear regression model. We frame our cost function and discuss how gradient descent finds the minimum of the cost function. We also emphasize on the fact that the particular choice of cost function makes it a convex optimization problem and eliminates the risk of a local minima for us. We campare the batch, stochastic and mini-batch approaches to minimization of cost function.
Topics covered
  • Gradient descent
  • Batch gradient descent
  • Stochastic gradient descent
  • Mini-batch gradient descent
  • Global vs. local minima
Sample slide Sample video
We discuss the different evaluation metrics for a regression model and in what scenarios each of them might be a good choice.
Topics covered
  • Mean absolute error
  • Root mean square error
  • R-squared and adjusted R-squared measure
Sample slide Sample video
Modern compute resources incentivize overfitting and even practitioners fall for it. We discuss the intuition behind regularization and the penalty parameter. We discuss the L1 and L2 penalty and ridge regression and give a quick overview of LASSO and Ridge regression.
Topics covered
  • Regularization intuition
  • L1 penalty and LASSO
  • L2 penalty and Ridge regression
Sample slide Sample video
We will build a linear regression model to build a real-estate price predictor. We will see how adjusting the regularization penalty and number of rounds of parameter update can result in a substanial improvement in both the Mean Absolute Error and standard deviating on a 10-fold cross validation.
Topics covered
  • Linear regression model
  • Adjusting the regularization penalty and number of rounds to get a better model and improve the estimate (MAE and standard deviation)
Sample slide Sample video
The first challenge of big data isn’t one of analysis, but rather of volume and velocity. How do you process terabytes of data in a reliable, relatively rapid way? We teach you the basics of MapReduce and Hadoop Distributed File System, the technologies which underly Hadoop, the most popular distributed computing platform. We also introduce you to Hive, Mahout and Spark, the next wave of distributed analysis platforms. Learn how distributed computing works to be able to scale machine learning training on terabytes of data. The hands-on lab will take you through the process step-by-step on setting up a Hadoop cluster to handle processing big data.
Topics covered
  • Distributed computing and cloud infrastructure
  • Hadoop
  • Hadoop Distributed File System
  • MapReduce
  • Hive
  • Mahout
  • Spark
Sample slide Sample video
Often the data that we are working with is not sitting in a database or files, it is being continuously streamed from a source. Network systems, sensor devices, 24-hour monitoring devices, and the like, are constantly streaming and recording data. Learn how to handle the end-to-end process of handling these data, from extracting the data, to processing it, to filtering out important data and analyzing the data on the fly, near real-time. We take you through building your own end-to-end ETL (extract, transform, load) pipeline in the cloud. You will stream data from a source such as Twitter, or credit card transactions, or a smartphone to an event ingestor. This processes the data and writes it out to cloud storage. You will then be able to read the data into Azure for analysis and processing.
Topics covered
  • Extract, transform, and load pipelines,
  • Data ingestion
  • Event brokers
  • Stream storage
  • Azure Event Hub
  • Stream Processing
  • Stream Processing
  • Event processors
  • Access rights and access policies
  • Querying streaming data and analysis
Sample slide
A user-interface into a model makes it easier to see how it would work in the real world, where a new customer enters the systems and data is collected on their age, gender, and so on. We teach you direct and simple processes for setting up real-time prediction endpoints in the cloud, allowing you to access your trained model from anywhere in the world. We walk you through constructing your own endpoints and show a few practical demos of how this can be used to expose a predictive model to anyone you’d like to use it and see how it takes new data and makes a prediction.
Topics covered
  • APIs
  • APIs
We introduce you to the wide world of Big Data, throwing back the curtain on the diversity and ubiquity of data science in the modern world. We also give you a bird’s eye view of the subfields of predictive analytics and the pieces of a big data pipeline.
Topics covered
  • Big Data
  • ETL Pipelines
  • Data Mining
  • Predictive Analytics
Sample slide Sample video
All great learning opportunities are built on a solid foundation. This session is jam-packed with all the background information, technical terminology, and basic knowledge that you will need to hit the ground running on the first day of the bootcamp.
Topics covered
  • Dataset types
  • Data preprocessing
  • Similarity
  • Data exploration
Sample slide Sample video
Here we introduce the basics of the R programming language. R is a free, open-source statistical programming platform. It is designed to make many of the most common data processing tasks as simple as possible. With this knowledge, you’ll be able to engage fully with the hands-on exercises in the class.
Topics covered
  • R basics
  • R data types
  • R language features
  • R visualization
Sample video
Azure Machine Learning Studio is a fully featured graphical data science tool in the cloud. You will learn how to upload, analyze, visualize, manipulate, and clean data using the clean and intuitive interface of Azure ML
Topics covered
  • Azure ML basics
  • Azure ML preprocessing
  • Azure ML visualization
Sample video
You will apply your learning, knowledge and skills of data science throughout each day of the bootcamp. We coach you throughout the week to put those new skills to the test with a real problem. Kaggle’s Titanic survival prediction competition is the perfect testing ground to cut your teeth on. You’ll compete against your fellow students, with the top 2-3 contenders receiving a special prize.
Topics covered
  • Feature Engineering
  • Model Training
  • Model Tuning
  • Model Tuning
Naive Bayes is one of the most popular and widely used classfication algorithms, particularly in text analysis. It is also a simple, fast, and small algorithm suitable for use on datasets of any size. We teach you how Naive Bayes works, why it works, and when it is likely to break down.
Topics covered
  • Conditional Probability
  • Bayes’ Rule
  • Independence
  • Naive Bayes
Sample slide Sample video
Logistic Regression is one of the oldest and best understood classification algorithms. While not suitable for every application, it is fast to run and cheap to store. We will teach you how logistic regression fits a dataset to make predictions, as well as when and why to use it.
Topics covered
  • Cost Functions
  • Logit Function
  • Decision Boundaries
Sample slide Sample video
With the massive increase in velocity and volume of data, even the largest and fastest SQL database lags under the load of millions of requests per second. We teach you how NoSQL databases solve this problem, sacrificing a small amount of consistency for a massive increase in durability.
Topics covered
  • CAP theorem
  • NoSQL
  • HBase
Sample slide Sample video
The world of data science and data engineering is larger than we have time to cover in the bootcamp. We want you to be as equipped to tackle this world as possible, so we have written a 350+ page textbook filled with step by step tutorials introducing you to many different tools. You will get a copy of this book at the bootcamp, allowing you to learn this additional information at your own pace.
Topics covered
  • Azure SQL Database
  • HBase
  • Hadoop
  • HDInsight
  • Azure PowerShell
  • Mahout
  • Spark
  • Live Twitter Sentiment Analysis
Sample slide
Your learning does not stop after the bootcamp. You’ll be able to tune into a live webinar and keep practicing your skills with a walk-through example or exercise on a new topic every two weeks. Master your art and strengthen your skills with regular practice. The webinars will also be recorded to view at a more convenient time.
Topics covered
  • Numerous data science topics from Time Series Forecasting
  • to Resume Preparation
Request detailed curriculum

Diversify your skillset

According to a Microsoft study, 91% of hiring managers agreed that certification plays a major part in selecting candidates for data-related roles. Stand out from the competition, sign up now!

singapore-flag
Are you a Singapore citizen or permanent resident?

Data Science Dojo’s bootcamp is endorsed by CITREP+. Any Singaporean (PR or citizen) who successfully completes the bootcamp might be eligible for up to 3000 Singapore Dollars subsidy from CITREP+*

Why Data Science Dojo?

Comprehensive curriculum

Hands-down the best data science curriculum in industry. Learn how to leverage data science and data engineering for business impact.

Experienced instructors

Our instructors are also practicing data scientists. You will hear first-hand accounts of war-stories from actual projects and best practices.

Verifiable certificate

Upon successful completion of the training, you will be receiving a verifiable data science certificate from The University of New Mexico (7 CEUs).

R, Python, Azure, AWS …

We are a vendor and technology neutral data science program. The trainings are primarily conducted in R and Python with exercises in Azure and AWS.

Top-rated

Data Science Dojo is consistently ranked as a top data science bootcamp globally. Check out our video reviews and our reviews on SwitchUp and Course Report.

Trusted by industry

We have trained more than 5000 working professionals from 1500+ companies. Leading companies trust us for their data science training needs.

Instructor-led

We believe learning data science requires an attentive approach to teaching, so our immersive courses are taught live to give students the advantage of being coached through their experience in a dynamic and interactive environment.

Office hours

We are invested in your success. For our 16-week, remote data science training, we offer daily office hours. We ensure that no attendee is left behind.

Online learning

Our online learning platform contains a large number of video tutorials, R/Python Jupyter notebooks and more resources so you continue to learn

Our alumni work at these companies

What our students say

Harris Thamby

Manager

“ It was a great experience for increasing the expertise on data science. The abstract concepts were explained well and always focused on real applications and business cases. “
Watch & Read Harris’s Experience
quote
It just seems like I don’t have the whole picture and being here this week has really helped me understand what’s out there and what modern methods exist. I think...
Elizabeth-Burke-web.jpg
Elizabeth Burke
Media Research Analyst
Credit-Karma.png
quote
Wonderful lectures and materials. Working on Kaggle competition together and competing with each other is definitely interesting and helpful in practicing the new knowledge.
Steven Tong Sun
Steven (Tong) Sun
Data Scientist
accenture logo 1 1 » Data Science Dojo
quote
At the end of the fifth day I think all of us are at the same place, so that’s the beauty of this program. You could come from any background...
Kapil Pandey 1
Kapil Pandey
Analytics Manager
samsung logo dsd 1 » Data Science Dojo
quote
A lot of great info and solid quick start on the latest on the field. Raja is a great teacher and obviously has a lot of passion and care for...
Jonathan Mathews
Jonathan Mathews
Silicon Architecture Engineer
intel-logo

Pricing

Dojo
Guru
16 Week Training
Immersive, hands-on learning experience
In-class exercises
R/python hands-on exercises associated with all training modules
Instructor Support
Assistant instructors and online chat support through the bootcamp duration
50+ additional exercises
R/Python exercises to solidify key concepts covered in-class
Learning Platform Access
Single point access to videos, quizzes, slides and other learning material

1 month after bootcamp

One year

Video recording of training
Video recordings of sessions available within 2 – 3 days of every class.

1 month after bootcamp

One year

Quizzes
Assess your learnning with all training modules

1 month after bootcamp

One year

Alumni Network
Access to 5000+ alumni network at 1500+ companies globally
Software subscriptions during the training
Subscriptions for cloud infrastructure and other software during the bootcamp
Unlimited access to all content
Online data science learning with video tutorials and R/Python Jupyter notebooks
Verified certificate from UNM
Earn a data science certificate with 7 Continuing Education Units through The University of New Mexico
1:1 Mentoring session with a practitioner
Guidance from an industry practioner

2 hours

Total

One-time payment

USD 2799

USD 3799 27% off

Full tuition

Registration for Online Data Science Certificate

Dojo plan

Or enter your payment details below

Dojo plan

Payment Options

60 easy installments

USD 81 / month

skillsfund logo » Data Science Dojo

Available to US customers only.

Apply now

Income sharing agreement (ISA)

Don’t pay anything until you land a job.

Coming soon

One-time payment

USD 2919

USD 3999 27% off

Full tuition

Registration for Online Data Science Certificate

Guru plan

Or enter your payment details below

Guru plan

Payment Options

60 easy installments

USD 81 / month

skillsfund logo » Data Science Dojo

Available to US customers only.

Apply now

Income sharing agreement (ISA)

Don’t pay anything until you land a job.

Coming soon
Looking for a career in data science? Join
Practicum
That includes everthing in guru, plus...

Frequently Asked Questions

Yes, students must attend 80% of the lectures in order to qualify for The University of New Mexico certificate and receive the continuing education units.

Yes, you can receive 7 CEU’s after completing the bootcamp.

Once you have completed the training, you will be issued a certificate that you can print or add to your LinkedIn profile for others to see, which confirms that the course was completed through the University of New Mexico Continuing Education.