Data Wrangling and Feature Engineering in R

In February I had the privilege of traveling to Amsterdam to teach a great group of folks Data Science. As is so often the case, I felt I learned as much from the students as they learned from me. For example, one of the students asked for some R programming assistance in the area of Data Wrangling and Feature Engineering. The scenario in question really intrigued me. I knew how I could solve the problem using traditional non-functional programming techniques (e.g., using for loops), but I was looking for something more elegant.

More Features Faster with dplyr

In the hotel that evening I fired up RStudio and started noodling on the problem using my current go-to solution for Data Wrangling & Feature Engineering in R – the mighty dplyr package. I had so much fun working through the scenario that I decided to start a new YouTube series specifically devoted to the subject. Here’s some example code from the video series showing dplyr in action.

#======================================================================
# Add the new feature for the Title of each passenger
#
train <- train %>%
 mutate(Title = str_extract(Name, "[a-zA-Z]+\\."))

table(train$Title)



#======================================================================
# Condense titles down to small subset
#
titles.lookup <- data.frame(Title = c("Mr.", "Capt.", "Col.", "Don.", "Dr.",
                                      "Jonkheer.", "Major.", "Rev.", "Sir.",
                                      "Mrs.", "Dona.", "Lady.", "Mme.", "Countess.", 
                                      "Miss.", "Mlle.", "Ms.",
                                      "Master."),
                            New.Title = c(rep("Mr.", 9),
                                          rep("Mrs.", 5),
                                          rep("Miss.", 3),
                                          "Master."),
                                          stringsAsFactors = FALSE)
View(titles.lookup)

# Replace Titles using lookup table
train <- train %>%
  left_join(titles.lookup, by = "Title")
View(train)

train <- train %>%
  mutate(Title = New.Title) %>%
  select(-New.Title)
View(train)

Features the Old School Way

Now compare the above elegant (if I do say so myself ;-)) code with the following code from my Intro to Data Science series:


# Expand upon the realtionship between `Survived` and `Pclass` by adding the new `Title` variable to the
# data set and then explore a potential 3-dimensional relationship.

# Create a utility function to help with title extraction
extractTitle <- function(name) {
  name <- as.character(name) if (length(grep("Miss.", name)) > 0) {
    return ("Miss.")
  } else if (length(grep("Master.", name)) > 0) {
    return ("Master.")
  } else if (length(grep("Mrs.", name)) > 0) {
    return ("Mrs.")
  } else if (length(grep("Mr.", name)) > 0) {
    return ("Mr.")
  } else {
    return ("Other")
  }
}

titles <- NULL
for (i in 1:nrow(data.combined)) {
  titles <- c(titles, extractTitle(data.combined[i,"name"]))
}
data.combined$title <- as.factor(titles)

# Re-map titles to be more exact
titles[titles %in% c("Dona.", "the")] <- "Lady."
titles[titles %in% c("Ms.", "Mlle.")] <- "Miss."
titles[titles == "Mme."] <- "Mrs."
titles[titles %in% c("Jonkheer.", "Don.")] <- "Sir."
titles[titles %in% c("Col.", "Capt.", "Major.")] <- "Officer"
table(titles)


# Make title a factor
data.combined$new.title <- as.factor(titles)


# Collapse titles based on visual analysis
indexes <- which(data.combined$new.title == "Lady.")
data.combined$new.title[indexes] <- "Mrs."

indexes <- which(data.combined$new.title == "Dr." | 
                 data.combined$new.title == "Rev." |
                 data.combined$new.title == "Sir." |
                 data.combined$new.title == "Officer")
data.combined$new.title[indexes] <- "Mr."

Beautiful!

Best Features == Best Models

In our Bootcamp we spend a lot of time emphasizing that in the bulk of scenarios a Data Scientist is best served by focusing their time on Data Wrangling and (most importantly) Feature Engineering. So often quality Feature Engineering trumps everything else – algorithm selection, hyperparameter tuning, blending, etc. My work on this video series is aligned to our teachings on the importance of Feature Engineering. Hopefully folks get as much out of my new series as I am getting out of making it.

Enjoy and happy data sleuthing!

Want More? Full Tutorial on YouTube!