# loading the data using readr package
spooky_data <- readr::read_csv(file = "./../../../data/train.csv",
col_types = "ccc",
locale = locale("en"),
na = c("", "NA"))
# readr::read_csv does not transform string into factor
# being the author feature categorical by nature
# it is transformed into a factor
spooky_data$author <- as.factor(spooky_data$author)
The overall data includes 19579 observations with 3 features (id, text, author). Specifically 7900 excerpts (40.35 %) of Edgard Allan Poe, 5635 excerpts (28.78 %) of HP Lovecraft and 6044 excerpts (30.87 %) of Mary Wollstonecraft Shelley.
Avoid the madness!
It is forbidden to use all of the provided spooky data for finding our way through the unique spookyness of each author. We still want to evaluate how our intuition generalizes on a unseen excerpt/ sentence, right?? For this reason the given training data is split in two parts (using stratified random sampling)
- an actual training dataset (70% of the excerpts/ sentences), used for
- exploration and insight creation, and
- traing the classification model
- a test dataset (the remaining 30% of the excerpts/ sentences), used for
- evaluation of the accuracy of our classification model.
# setting the seed for reproducibility
trainIndex <- caret::createDataPartition(spooky_data$author, p = 0.7, list = FALSE, times = 1)
spooky_training <- spooky_data[trainIndex,]
spooky_testing <- spooky_data[-trainIndex,]
Specifically 5530 excerpts (40.35 %) of Edgard Allan Poe, 3945 excerpts (28.78 %) of HP Lovecraft and 4231 excerpts (30.87 %) of Mary Wollstonecraft Shelley.
Moving our first steps: from darkness into the light
Before start building any model, we need to understand tha data, build intuitions about the information contanined in the data and identify a way to use those intuitions to build a great predicting model.
Is the provided data useable?
Question: Does each observation has an
id? An excerpt/ sentence associated to it? An
missingValueSummary <- colSums(is.na(spooky_