Building a Predictive Model in Azure Machine Learning Studio
A Step-By-Step Tutorial
A Step-By-Step Tutorial
This tutorial will walk users through building a classification model in Azure Machine Learning by using the same process as a traditional data mining framework. We will use the public Titanic dataset for this tutorial. From the dataset, we can build a predictive model that will correctly classify whether you will live or die based upon a passenger’s demographic features and circumstances.
The Titanic dataset is one of the few datasets that is good for both beginners and experts because its complexity scales up with feature engineering. There are numerous public resources to obtain the Titanic dataset, however, the most complete (and clean) version of the data can be obtained from Kaggle, specifically their “train” data.
The train Titanic data has 891 rows, each one pertaining to an passenger on the RMS Titanic on the night of its disaster. The dataset also has 12 columns that each record an attribute about each occupant’s circumstances and demographics: user ID, passenger class, age, gender, name, number of siblings and spouses aboard, number of parents and children aboard, fare price, ticket number, cabin number, their port of embarkation, and whether they survived the ordeal or not.
For additional reading, a repository of biographies pertaining to everyone aboard the RMS Titanic can be found here (complete with pictures).
Most algorithms are unable to account for missing values and some treat it inconsistently from others. To address this, we must make sure our dataset contains no missing, “null”, or “NA” values. There are many ways to address missing values. We will cover three: replacement, exclusion, and deletion.
We used exclusion already when we made a conscious decision not to use “Cabin” attributes by dropping the column entirely.
Replacement is the most versatile and preferred method because it allows us to keep our data. It also minimizes collateral damage to other columns as a result of one cell’s bad behavior. In replacement, numerical values can easily be replaced with statistical values such as mean, median, or mode. The median is usually preferred for machine learning because it preserves the distribution of the data and is less affected by outliers. However, the median will skew and overload your frequencies, meaning it’ll mess with your bar graph but not your box plot.
We will cover deletion later in this section.
You can get better performance out of your models by retraining the algorithm with different parameters until it yields the best result for your specific metric. For a more in-depth analysis on how to optimize a model in Azure ML, click here for a video tutorial.
We optimized the two-class boosted decision forest and got a maximum RoC AuC of 0.861 from 0.817. See if you can beat it!
So now you have a predictive model, what next? You can now choose to deploy your model to the web. Click here for a video tutorial.