Spooky Author Identification: EDA

The content of this blog is based on some exploratory data analysis performed on the corpora provided for the “Spooky Author Identification” challenge at Kaggle [1]. The corpora includes excerpts/ sentences from some of the scariest writer of all times.

The Spooky Challenge

An Hallowen-based challenge [1] with the following goal: predict who was writing a sentence of a possible spooky story between Edgar Allan PoeHP Lovecraft and Mary Wollstonecraft Shelley.

“Deep into that darkness peering, long I stood there, wondering, fearing, doubting, dreaming dreams no mortal ever dared to dream before.” Edgar Allan Poe

“That is not dead which can eternal lie, And with strange aeons even death may die.” HP Lovecraft

“Life and death appeared to me ideal bounds, which I should first break through, and pour a torrent of light into our dark world.” Mary Wollstonecraft Shelley

The Toolset

The only tools available to us during this exploration will be our intuitioncuriosity and the selected packages. Specifically:

  • tidytext package, text mining for word processing and sentiment analysis using tidy tools
  • tidyverse package, an opinionated collection of R packages designed for data science
  • wordcloud package, pretty word clouds
  • gridExtra package, supporting functions to work with grid graphics
  • caret package, supporting function for performing stratified random sampling
  • corrplotpackage, a graphical display of a correlation matrix, confidence interval
# Required libraries
# if packages not installed
# install.packages("packageName")

library(tidytext)
library(tidyverse)
library(gridExtra)
library(wordcloud)
library(dplyr)
library(corrplot)

The Beginning of the Journey: the Spooky Data

We are given a csv file, the train.csv, containing some information about the authors. The information consists on a set of sentences written by the different authors (EAPHPLMWS). Each entry (line) in the file is an observation providing the following information:

  • an id, a unique id for the excerpt/ sentence (as a string)
  • the text, the excerpt/ sentence (as a string),
  • the author, the author of the excerpt/ sentence (as a string)
    • a categorical feature that can assume three possible values
      • EAP for Edgar Allan Poe,
      • HPL for HP Lovecraft,
      • MWS for Mary Wollstonecraft Shelley

Author: Pier Lorenzo Paracchini

He is a generalist with a passion for people, data and technology. He has a Master of Science in Electronic Engineering from the Politecnico Di Milano and works as an enthusiast developer with a data scientist twist in the software innovation sector in Statoil. His journey in data science and machine learning started in 2014.

LinkedIn

Follow us on:

# loading the data using readr package
spooky_data <- readr::read_csv(file = "./../../../data/train.csv",
                              col_types = "ccc",
                              locale = locale("en"),
                              na = c("", "NA"))

# readr::read_csv does not transform string into factor
# being the author feature categorical by nature
# it is transformed into a factor
spooky_data$author <- as.factor(spooky_data$author)

The overall data includes 19579 observations with 3 features (id, text, author). Specifically 7900 excerpts (40.35 %) of Edgard Allan Poe, 5635 excerpts (28.78 %) of HP Lovecraft and 6044 excerpts (30.87 %) of Mary Wollstonecraft Shelley.

Avoid the madness!

It is forbidden to use all of the provided spooky data for finding our way through the unique spookyness of each author. We still want to evaluate how our intuition generalizes on a unseen excerpt/ sentence, right?? For this reason the given training data is split in two parts (using stratified random sampling)

  • an actual training dataset (70% of the excerpts/ sentences), used for
    • exploration and insight creation, and
    • traing the classification model
  • test dataset (the remaining 30% of the excerpts/ sentences), used for
    • evaluation of the accuracy of our classification model.
# setting the seed for reproducibility
set.seed(19711004)
trainIndex <- caret::createDataPartition(spooky_data$author, p = 0.7, list = FALSE, times = 1)
spooky_training <- spooky_data[trainIndex,]
spooky_testing <- spooky_data[-trainIndex,]

Specifically 5530 excerpts (40.35 %) of Edgard Allan Poe, 3945 excerpts (28.78 %) of HP Lovecraft and 4231 excerpts (30.87 %) of Mary Wollstonecraft Shelley.

Moving our first steps: from darkness into the light

Before start building any model, we need to understand tha data, build intuitions about the information contanined in the data and identify a way to use those intuitions to build a great predicting model.

Is the provided data useable?

Question: Does each observation has an id? An excerpt/ sentence associated to it? An author?

missingValueSummary <- colSums(is.na(spooky_