There is a thing as too much data cleaning. The more we clean and remove the more “lost in translation” the textual message may become. We may inadvertently strip information or meaning from our messages so that by the time our machine learning algorithm sees the textual data, much or all of the relevant information has been stripped away. For each type of cleaning above, there are situations in which you will want to either skip it altogether or selectively apply it. As in all data science situations, experimentation and good domain knowledge are required to achieve the best results.
When do we want to avoid over-cleaning?
Special Characters: The advent of email, social media, and text messaging have given rise to text based emoticons represented by ASCII special characters. For example, if you were building a sentiment predictor for text, text based emoticons like “=)” or “>:(” are very indicative of sentiment because they directly reference happy or sad. Stripping our messages of these emoticons by removing special characters will also strip meaning from our message.
Numbers: Consider the infinitely gridlocked freeway in Washington state, “I-405”. In a sentiment predictor model, anytime someone talks about “I-405”, more likely than not the document should be classified as “negative”. However, by removing numbers and special characters, the word now becomes “I”. Our models will be unable to use this information, which, based on domain knowledge, we would expect to be a strong predictor.
Casing: Even case can carry useful information sometimes. For instance the word “trump” may carry a different sentiment than “Trump” with a capital T, representing someone’s last name. One solution to filter out proper nouns that may contain information is through name entity recognition, where we use a combination of predefined dictionaries and scanning of the surrounding syntax (sometimes called “lexical analysis”). Using this, we can identify people, organizations, and locations.
Next, we’ll talk about stemming and Lemmatization as a way to help computers understand that different versions of words can have the same meaning (ex. run, running, runs)