Building Custom R Models in Azure Machine Learning Studio
A Step-By-Step Tutorial
A Step-By-Step Tutorial
Azure Machine Learning Studio has a couple dozen built-in machine learning algorithms. But what if you need an algorithm that is not there? What if you want to customize certain algorithms? Azure can use any R or Python based machine learning package and associated algorithms! It’s called the “create model” module. With it, you can leverage the entire open-sourced R and Python communities.
The Bike Sharing dataset is a great data set for exploring Azure ML’s new R-script and R-model modules. The R-script allows for easy feature engineering from date-times and the R-model module lets us take advantage of R’s randomForest library. The data can be obtained from Kaggle; this tutorial specifically uses their “train” dataset.
The Bike Sharing dataset has 10,886 observations, each one pertaining to a specific hour from the first 19 days of each month from 2011 to 2012. The dataset consists of 11 columns that record information about bike rentals: date-time, season, working day, weather, temp, “feels like” temp, humidity, windspeed, casual rentals, registered rentals, and total rentals.
There is an untapped wealth of prediction power hidden in the “datetime” column. However, it needs to be converted from its current form. Conveniently, Azure ML has a module for running R scripts, which can take advantage of R’s built-in functionality for extracting features from the date-time data.
This dataset only has one observation where weather = 4. Since this is a categorical variable, R will result in an error if it ends up in the test data split. This is because R expects the number of levels for each categorical variable to equal the number of levels found in the training data split. Therefore, it must be removed.
Before creating our random forest, we must identify columns that add little-to-no value for predictive modeling. These columns will be dropped.
Since we are predicting total count, the registered bike rental and casual bike rental columns must be dropped. Together, these values add up to total count, which would lead to a successful but uninformative model because the values would simply be summed to see the total count. One could train separate models to predict casual and registered bike rentals independently. Azure ML would make it very easy to include these models in our experiment after creating one for total count.
The third candidate for removal is the datetime column. Each observation has a unique date-time, so this column with just add noise to our model, especially since we extracted all the useful information (day of week, time of day etc.)
Now that the dropped columns have been chosen, drag in the “Project Columns” module to drop datetime, casual, and registered. Launch the column selector and select “All columns” from the dropdown next to “Begin With”. Change “Include” to “Exclude using the dropdown and then select the columns we are dropping.