Make Words Usable for Machine Learning
In the past, we have talked about how to build machine learning models on structured data-sets. However, life does not always give us data that is clean and structured. Much of the information generated by humans has little or no formal structure: emails, tweets, blogs, reviews, status updates, surveys, legal documents, and so much more. There is a wealth of knowledge stored in these kinds of documents which data scientists and analysts want access to. “Text analytics” is the process by which you extract the useful information out of text.
Some examples include:
- Predicting the stock market with Tweets
- Detecting fraudulent activities from email messages
- Forecast box office success from just the screen play
- Evaluating cultural fit and personality type from resumes
All of these written texts are unstructured; machine learning algorithms and techniques work best (or often, work only) on structured data. So, in order for our machine learning models to operate on these documents, we must convert the unstructured text into a structured matrix. Usually this is done by transforming each document into a sparse matrix (a really big but mostly empty table). Each word gets its own column in the data-set, which tracks whether or not a word appears (binary) in the text OR how often the word appears (term-frequency). For example, consider the two statements below. They have been transformed into a simple term frequency matrix. Each word gets a distinct column and the frequency of occurrence is tracked. If this were a binary matrix, there would only be ones and zeros instead of a count of the terms.
|Twinkle, twinkle, little star.||2||1||1|
|Twinkle, twinkle, all the night.||2||1||1||1|
Build a Matrix
While our example was simple (6 words), term frequency matrices on larger datasets can be tricky.
Imagine turning every word in the Oxford English dictionary into a matrix, that’s 171,476 columns. Now imagine adding everyone’s names, every corporation or product or street name that ever existed. Now feed it slang. Feed it every rap song. Feed it fantasy novels like Lord of the Rings or Harry Potter so that our model will know what to do when it encounters “The Shire” or “Hogwarts”. Good, now that’s just English. Do the same thing again for Russian, Mandarin, and every other language.
After this is accomplished, we are approaching a several billion column matrix; two problems arise. First, it becomes computationally unfeasible and memory intensive to perform calculations over this matrix. Secondly, the curse of dimensionality kicks in and distance measurements become so absurdly large in scale that they all seem the same. Most of the research and time that goes into natural language processing is less about the syntax of language (which is important) but more about how to reduce the size of this matrix.
Now that we know what we must do and the challenges that we must face in order to reach our desired result. The next three blogs in the series will be directly to address these problems. We will introduce you to 3 concepts: conforming, stemming, and stop word removal.