Learn to build large language model applications: vector databases, langchain, fine tuning and prompt engineering. Learn more

Every cook knows how to avoid Type I Error: just remove the batteries. Let’s also learn how to reduce the chances of Type II errors. 

Why type I and type II errors matter

A/B testing is an essential component of large-scale online services today. So essential, that every online business worth mentioning has been doing it for the last 10 years.

A/B testing is also used in email marketing by all major online retailers. The Obama for America data science team received a lot of press coverage for leveraging data science, especially A/B testing during the presidential campaign.

Hypothesis Testing Outcomes - type I and Type II errors
Hypothesis testing outcome – Data Science Dojo

Here is an interesting article on this topic along with a data science bootcamp that teaches a/b testing and statistical analysis.

If you have been involved in anything related to A/B testing (online experimentation) on UI, relevance or email marketing, chances are that you have heard of Type i and Type ii error. The usage of these terms is common but a good understanding of them is not.

I have seen illustrations as simple as this.

Examples of type I and type II errors

I intend to share two great examples I recently read that will help you remember this especially important concept in hypothesis testing.

Type I error: An alarm without a fire.

Type II error: A fire without an alarm.

Every cook knows how to avoid Type I Error – just remove the batteries. Unfortunately, this increases the incidences of Type II error.

Reducing the chances of Type II error would mean making the alarm hypersensitive, which in turn would increase the chances of Type I error.

Another way to remember this is by recalling the story of the Boy Who Cried Wolf.

Boy Who Cried Wolf


Null hypothesis testing: There is no wolf.

Alternative hypothesis testing: There is a wolf.

Villagers believing the boy when there was no wolf (Reject the null hypothesis incorrectly): Type 1 Error. Villagers not believing the boy when there was a wolf (Rejecting alternative hypothesis incorrectly): Type 2 Error


The purpose of the post is not to explain type 1 and type 2 error. If this is the first time you are hearing about these terms, here is the Wikipedia entry: Type I and Type II Error.

June 15, 2022

During political campaigns, political candidates not only need votes but also need financial contributions. Here’s what the campaign finance data set from 2012 looks like.

Understanding individual political contribution by the occupation of top 1% vs bottom 99% in political campaigns

A political candidate not only needs votes, but they also need money.  In today’s multi-media world millions of dollars are necessary to effectively run campaign finances.   To win the election battle citizens will be bombarded with ads that cost millions.  Other mounting expenses include wages for staff, consultants, surveyors, grassroots activists, media experts, wonks, and policy analysts.  The figures are staggering with the next presidential election year campaigns likely to cost more than ten billion dollars.

Election Cost
The total cost of US elections from the year 1998 to 12014

Opensecrets.org has summarized the money spent by presidential candidates, Senate and House candidates, political parties, and independent interest groups that played an influential role in the federal elections by cycle.  There’s no sign of less spending in future elections.

The 2016 presidential election cycle is already underway, the fund-raising war has already begun.  Koch brothers’ political organization released an $889 million budget in January 2015 supporting conservative campaigns in the 2016 presidential contest.  As for primary presidential candidates, the Hillary Clinton Campaign aims to raise at least $100 million for the primary election.  On the other side of the political aisle, analysts speculated primary candidate Jeb Bush will raise over $100 million when he discloses his financial position in July.

In my mind I imagine that money coming from millionaires and billionaires or mega-corporations intent on promoting candidates that favor their cause.  But who are these people?  And how about middle-class citizens like me?  Does my paltry $200 amount to anything?  Does the spending power of the 99% have any impact on the outcome of an election?   Even as a novice I knew I would never understand American politics by listening to TV talking heads or the candidates and their say-nothing ads but by following the money.

By investigating real data about where the stream of money dominating our elections comes from and the role it plays in the success of an election, I hope to find some insight into all the political noise.   Thanks to the Federal Election Campaign Act, which requires candidate committees, party committees, and political action committees (PACs) to disclose reports on the money they raise and spend and identify individuals who give more than $200 in an election cycle, a wealth of public data exists to explore. I choose to focus on individual contributions to federal committees greater than $200 for the election cycle 2011-2012.

The data is publicly available at http://www.fec.gov/finance/disclosure/ftpdet.shtml.

Creating the groups

In the 2012 election cycle, which includes congressional and primary elections, the total amount of individual donations collected was USD 784 million.  USD 220 million came from the top 1% of donors, which made up 28% of the total contribution.  These elites’ wealthy donors were 7119 individuals, each having donated at least USD 10,000 to federal committees.  So, who are the top 1%?  What do they do for a living that gave them such financial power to support political committees?

The unique occupation titles from the dataset are simply overwhelming and difficult to construct appropriate analysis. Thus, these occupations were classified into twenty-two occupation groups according to the employment definition from the Bureau of Statistics.  Additional categories were created due to a lack of definition to classify them into appropriate groups.  Among them are “Retired,” “Unemployed,” “Homemaker,” and “Politicians.”

Immediately from Figure 1, we observe the “Management” occupation group contributed the highest total amount in the 2012 cycle for Democrats, Republicans, and Other parties, respectively.  Other top donors by occupation groups are “Business and Financial Operations,” “Retired,” “Homemaker,” “Politicians,” and “Legal.”  Overall, the Republicans Parties received a more individual contribution from most of the occupation groups, with noticeable exception from “Legal” and “Arts, Design, Entertainment, Sports and Media.”  The total contribution given to “Other” non-Democratic/Republican was abysmal in comparison.

Figure 1: Total contribution of top 1% by occupation group

top sum
Total USD of the top 1% contributors by occupational group

One might conclude that the reason for the “Management” group being the top donor is obvious given these people are CEOs, CFOs, Presidents, Directors, Managers, and many other management titles in a company.  According to the Bureau of Statistics, the “Management” group earned the highest median wages among all other occupation groups.  They simply had more to give.  The same argument could be applied to the “Business and Financial Operations” group, which is comprised of people who held jobs as investors, business owners, real estate developers, bankers, etc.

Perhaps we could look at the individual contribution by occupation group from another angle.  When analyzing the average contribution by occupation group, the “Politicians” group became top of the chart. Individuals belonging to this category are either currently holding public office or they had declared candidacy for office with no other occupation reported.  Since there is no limit on how much candidates may contribute to their committee, this group represents rich individuals funding their campaigns.

Figure 2: Average contribution of Top 1% by occupation groups

top average

Suspiciously, the average amount per politician given to Republican committees is dramatically higher than other parties.  Further analysis indicated that the outlier is candidate John Huntsman, who donated about USD 5 million to his committee, “Jon Huntsman for President Inc.” This has inflated the average contribution dramatically.  The same phenomenon was also observed among the “Management” group, where the average contribution to the “Other” party was significantly higher compared to traditional parties.

Out of the five donors who contributed to an independent party from the “Management” group, William Bloomfield alone donated USD 1.3 million (out of the USD 1.45 million total amount collected) to his “Bloomfield for Congress” committee. According to the data, he was the Chairman of Baron Real Estate.  This is an example of a wealthy elite spending a hefty sum of money to buy his way into the election race.

Donald Trump, a billionaire business mogul made headlines recently by declaring his intention to run for presidency 2016 election.  He certainly has no trouble paying for his campaign. After excluding the occupation groups “Politicians” and “Management,” with the intention to visualize the comparison among groups more clearly, the contrast became less dramatic.  No doubt, the average contribution to Republicans Committees is consistently higher than other parties in most of the occupation groups.

Figure 3: Average contribution of Top 1% by occupation group excluding politicians and management group

top average nopolitician

Could a similar story of the top 1% be told for the bottom 99%?  Overall, the top 5 contributors by occupation group are quite similar between the top 1% and bottom 99%.  Once again “Management” group collectively raised the most amount of donations to the Democrats and Republicans Parties.  The biggest difference here is that “Politicians” is no longer the top contributor in the bottom 99% demographic.

Figure 4: Total contribution of bottom 99% by Occupation Group

sum bottom 99

Homemakers consistently rank high in both total contributions as well as average contribution, in both the top 1% and bottom 99%.  On average, homemakers from the bottom 99% donated about $1500 meanwhile homemakers from the top 1% donated about $30,000 to their chosen political committees.  Clearly across all levels of socioeconomic status spouses and stay-at-home parents play a key role in the fundraising war.  Since the term “Homemaker” is not well-defined, I can only assume their source of money comes from a spouse, inherited wealth, or personal savings.

Figure 5: Average contribution of bottom 99% by occupation group

average bottom 99

Another observation we could draw from the average contribution from the 99% plot is that “Other” non-Democrats/Republicans Parties depend heavily on the 99% as a source of funding for their political campaigns. Third-party candidates appear to be drawing most of their support from the little guy.

Figure 6: Median wages and median contribution by occupation group

wages v contribution

Median wages and median contribution

Another interesting question warranting further investigation is how the amount individuals contributed to political committees is proportionately consistent across occupation groups.  When we plotted median wages per occupation group side by side with median political contribution, the median of donation per group is rather constant while the median income varies significantly across groups.  This implies that despite contributing the most overall, as a percentage of their income the wealthiest donors contributed the least.

Campaign finance data: The takeaway

The take-home message from this analysis is that the top 1% wealthy elite seems to be driving the momentum of fundraising for the election campaign.  I suspect most of them have full intention to support candidates who would look out for their interest if indeed they got elected.  We middle-class citizens may not have the ability to compete financially with these millionaires and billionaires, but our single vote is as powerful as their vote.  The best thing we could do as citizens is to educate ourselves on issues that matter to the future of our country.

Learn more about data science here

June 15, 2022

What does the data look like for political contributions when we look at each state? How does generosity appear in each state, and what does state activism look like?

Generosity and activism by the state

A few days ago, I published an article about analyzing financial contributions to political campaigns.

When we look at the total individual contributions to political committees by state, it is apparent that California, New York, and Texas take the lead. Given the fact that these states have the highest population, can we justify a claim that the residents are more generous when it comes to political contributions?

Generosity and Activism by State
Individual contributions from 2011-2014 by State

Individual political contributions per capita

In contrast, the contribution per capita tells a different story.  After this adjustment for population by state, Massachusetts and Connecticut lead for political generosity. Meanwhile Idaho and Mississippi consistently collect fewer total contributions and less per person. Other generous states are New York, Virginia, Wyoming, California, and Colorado.

Individual Political Contributions per Capita
A map of individual political contributions per capita

Measuring political activism

Can we measure political activism by analyzing the individual contribution data? When we look at the number of donors that make up the total population by state, surprisingly Montana seems to have a high number of political donors.

Measuring Political Activism
Percentage of state population donated


June 15, 2022

Have you noticed that we have two machine learning demos on our site that allow you to deploy predictive models?

The Titanic Survival Predictor is designed to work with a Microsoft Azure model for machine learning. The AWS Machine Learning Caller is our new demo that connects to an Amazon Machine Learning model.

You can use Microsoft Azure ML or Amazon ML to build your machine learning model, but what’s the difference between the two approaches?

The idea is that you can use Microsoft Azure ML or Amazon ML to build a machine-learning model, and then use our demo to input values for the prediction.

Each ML program provides an endpoint that you can use to access the model and run predictions. Our demos interface with that endpoint and provide a graphic user interface for making predictions.

So, what’s the difference between the machine learning demos?

First of all, the backend is different. But we’ll keep this brief.

The graphic below shows what types of models can be run through the demo.

  • The cruise ship represents the Titanic classification model generated from our Azure ML tutorial.
  • The iris represents any classification model, such as a model used to predict species from a set of measurements.
  • The complicated graph represents a regression model. Regression models are used to predict a number given a set of input numbers.
Titanic Survivor Predictor - machine learning
AWS machine learning caller

You can see that the Titanic model can link to both demos, but the classification (iris) model only links to our Amazon demo. The numerical dataset does not work with either of our demos.

The demos are currently limited to classification models only (because linear regression models work differently and requires a different backend).

MLaaS: User perspectives

From the user perspective, the Titanic Survival Predictor is built for a specific purpose. It interfaces with the exact Titanic classification model that we created for Azure and is included as part of our bootcamp. Users can change all the tuning parameters and make the model unique.

However, the input variables, or “schema” to be labeled the same way as the original model or it won’t work.

So, if you rename one of the columns, the demo will have an error. However, since we published the Azure model online, it’s pretty easy to copy the model and change some parameters.

To get your predictive model to work with our Titanic Survival Predictor demo, you’ll need the following information:

  • Name (used to generate your own url)
  • Post URL (or endpoint)
  • API key

The AWS Machine Learning Caller is not built for a specific dataset like Titanic. It will work with any logistic regression model built in Amazon Machine Learning. When you input your access keys and model id, our demo automatically pulls the schema from Amazon.

It does not require a specific schema like our Titanic Survival Predictor.

To get your predictive model to work with our AWS Machine Learning Caller demo, you’ll need the following information:

  • Access key
  • Secret access key
  • AWS Account Region
  • AWS ML Model ID

Why do two machine learning demos do similar things?

These are training tools for our 5-day bootcamp. We use Microsoft Azure to teach classification models. The software has tools for data cleaning and manipulation. The way that the tools are laid out is visual and easy to understand. It provides a clear organization of the processes: input data, clean data, build a model, evaluate the model, and deploy the model.

Microsoft Azure has been a great way to teach the model-building process.

We’ve recently added Amazon Machine Learning to our curriculum. The program is simpler, where all the processes described above are automated. Amazon ML walks users through the process.

However, it does provide slightly different evaluation metrics than Microsoft Azure, so we use it to teach regression and classification models as well.

Help us get better!

We are always looking for ways to incorporate new tools into our curriculum. If there is a tool that you think we ought to have, please let us know in the comments.

Or, you can contact us here


June 15, 2022

Data Science Dojo’s non-profit data science fellowship program selected fellows to use data science for social good in their respective careers.

Redmond startup enables non-profit data science training

Redmond, Washington – March 7, 2016 – Data Science Dojo, an educational start-up, has selected its first non-profit data science fellowship applicant to receive tuition-free training in data science and data engineering.

The fellow can use this non-profit data fellowship for work involving data science for social good. Michael Dandrea from the Sustainability Accounting Standards Board (SASB), a 501c(3) non-profit, will attend the company’s 5-day Seattle Bootcamp, scheduled for March 28 – April 1, 2016.

As a Data Science Dojo Fellow, he receives 50 hours of tuition-free training, books and materials, and some travel expenses.

Why does Michael want to attend a Seattle Bootcamp on data science?

“(SASB’s) mission is dependent on researching large amounts of structured and unstructured sustainability data for trends that support the greater disclosure of sustainability information to the public.

However, as a 35-person organization without significant technology resources and awareness, it is challenging for us to explore possible means to scale our research efforts with the subject.”

data science dojo training

As this domain has become increasingly central to research and fundraising efforts, non-profits want to take advantage of these new tools. These new tools can help with the end goal of the industry for social good.

However, maintaining an in-house data science team is expensive.

This non-profit data science fellowship program offers non-profits the opportunity to train one of their employees in data science and engineering.

By the end of the class, Michael may not be a data scientist, but he will be able to help SASB perform research more efficiently, saving time and money.

About the Data Science Dojo non-profit data science fellowship program:

The Non-Profit Data Science Fellowship program is open to students and non-profit employees. Four fellows are selected for each calendar year. To date, Data Science Dojo has selected six student fellows and one non-profit employee.

Interested students and non-profit employees can submit their applications at http://datasciencedojo.com/bootcamp/fellowship/.

About Data Science Dojo:

Data Science Dojo is an education startup dedicated to enabling professionals to extract actionable insights from data. Our 5-day, intensive boot camps and corporate training consist of hands-on labs, critical thinking sessions, and a Kaggle competition.

Graduates deploy predictive models and evaluate the effectiveness of different machine learning algorithms. Through these data science boot camps, we are building a community of mentors, students, and professionals committed to unleashing the potential of the industry.

Contributor: Michael DAndrea

June 15, 2022

This Azure tutorial will walk you through deploying a predictive model in Azure Machine Learning, using the Titanic dataset.

The classification model, covered in this article, uses the Titanic dataset to predict whether a passenger will live or die, based on demographic information. We’ve already built the model for you and the front-end UI. This tutorial will show you how to customize the Titanic model we built and deploy your own version.

MLaaS overview:

About the data

The Titanic dataset’s complexity scales up with feature engineering, making it one of the few datasets good for both beginners and experts. There are numerous public resources to obtain the Titanic dataset, however, the most complete (and clean) version of the data can be obtained from Kaggle, specifically their “train” data.

The “train” Titanic data ships with 891 rows, each one about a passenger on the RMS Titanic, the night of the disaster. The dataset also has 12 columns that record attributes of each passenger’s circumstances and demographics such as passenger id, passenger class, age, gender, name, number of siblings and spouses aboard, number of parents and children aboard, fare, ticket number, cabin number, port of embarkation, and whether or not they survived.

For additional reading, a repository of biographies about everyone aboard the RMS Titanic can be found here (complete with pictures).

Titanic route

Getting the experiment

About the titanic survival User Interface

From the dataset, we will build a predictive model and deploy the model in AzureML as a web service. Data Science Dojo has built a front-end UI to interact with such a web service.

Click on the link below to view a finished version of this deployed web service.

Titanic Survival Predictor

Use the app to see what your chance of survival might have been if you were on the Titanic. Play around with the different variables. What factors does the model deem important in calculating your predicted survival rate?

The following tutorial will walk you through how to deploy a titanic prediction model as a web service.


titanic survival predictor

Get an Azure ML account

This MLaaS tutorial assumes that you already have an AzureML workspace. If you do not, please visit the following link for a tutorial on how to create one.

Creating Azure ML Workspace

Please note that an Azure ML 88-hourfree trial does not have the option of deploying a web service.

If you already have an AzureML workspace, then simply visit:


Clone the experiment

For this MLaaS tutorial, we will provide you with the completed experiment by letting you clone ours. If you are curious about how we created the experiment, please view our companion tutorial where we talk about  where we talk about the process of data mining.




Our experiment is hosted in the Azure ML public gallery. Navigate to the experiment by clicking on the link below or by clicking “Clone ont to Azure ML” within the Titanic Survival Predictor web page itself. The Azure ML Gallery is a place where people can showcase their experiments within the Azure ML community.

Gallery Titanic Experiment

Click on the “open in studio” button.

The experiment and dataset will be copied to your studio workspace. You should now see a bunch of modules linked together in a workflow. However, since we have not run the experiment, the workflow only a set of instructions which Azure ML will use to build your models. We will have to run the experiment to produce anything.

Click the “run” button at the bottom middle of the AzureML window.

This will execute the workflow that is present within the experiment. The experiment will take about 2 minutes and 30 seconds to finish running. Wait until every module has a green checkmark next to it. This indicates that each module has finished running.

MLaaS predictive model evaluation and deployment

Select an algorithm

You may have noticed that the cloned experiment shipped with two predictive models–two different decision forests. However, because we can only deploy one predictive model, we should see which performs better. Right click on the output node of the evaluate model module and click “visualize.”


visualize model

Evaluate your model

For the purpose of this tutorial, we will define the “better” performing model as the one which scored a higher RoC AuC. We will gloss over evaluating performance metrics of classification models since that would require a longer, more in-depth discussion.

In the evaluate model module, you will see a “ROC” graph with a blue and red line graphed on it. The blue line represents the RoC performance of the model on the left and the red line represents the performance of the model on the right.The higher the curve is on the graph, the better the performance. Since the red curve, the right model, is higher on the graph than the blue curve, we can say that the right model is the better performing model in this case. We will now deploy the corresponding decision tree model.


evaluate model

Deploy the experiment

Before deployment, all modules must have a green check mark next to them.

To deploy the selected decision forest model, select the “train model module” on the right.

While that is selected, hover over the “setup web service” button on the bottom middle of the screen. A pull-up menu will appear. Select “predictive web service”.

Azure ML will now remove and consolidate unnecessary modules, then it will automatically save the predictive model as a trained model and setup web service inputs and outputs.


train model (1)

deploy model

Drop the response class

Our web service is almost complete. However, we need to tune the logic behind the web service function. The score model module is the module that will execute the algorithm against a given dataset. The score model module can also be called the “prediction module” because that is what happens when you apply a trained algorithm against a dataset.

You will notice that the score model module also takes in a dataset on the right input node. When deploying a predictive model, the score model module will need a copy of the required schema. The dataset used to train the model is fed back into the score model module because that is the schema that our trained algorithm currently knows.

However, that schema also hols our response class “survived,” the attribute that we are trying to predict. We must now drop the survived column. To do this we will use the “project columns” module. Search for it in the search bar on the left side of the AzureML window, then drag it into the workspace.

Replicate the picture on the left by connecting the last metadata editor’s output node to the input of the new project columns module. Then connect the output of the new project columns module with the right input of the score model module.

Select the project columns module once the connections have been made. A “properties” window will appear on the right side of the AzureML window. Click on “launch column selector.”

To drop the “Survived” column we will “Begin with: All Columns,” then choose to “Exclude” by “column names,” “Survived.”


drop target

drop target - 1

Reroute web service input

We must now point our web service input in the correct direction. The web service input is currently pointing to the beginning of the workflow where data was cleaned, columns were renamed, and columns were dropped. However, the form on the Titanic Prediction App will do the cleansing for you.

Let’s reroute the web service input to point directly at our score model module. Drag the web service input module down toward the score model module and connect it to the right input node of the score model (the same node that the newly added project columns module is also connected to).

Deploy your model

Once all the rerouting has been done, run your experiment one last time. A “Deploy Web Service” button should now be clickable at the bottom middle of the Azure ML window. Click this and AzureML will automatically create and host your web service API with your own endpoints and post-URL.


deploy model -1

Exposing the deployed webservice


API Diagram


Test your webservice

You should now be on the web deployment screen for your web service. Congratulations! You are now in possession of a web service that is connected to a live predictive model. Let’s test this model to see if it behaves properly.

Click the “test” button in the middle of the web deployment screen. A window with a form should popup. This form should look familiar because it is the same form that the Titanic Predictor App was showing you.

Send the form a few values to see what it returns. The predictions will come in JSON format. The last number in JSON is the prediction itself, which should be a decimal akin to a percentage. This percentage is the predicted likelihood of survival based upon the given parameters, or in this case the passenger’s circumstances while aboard the Titanic.


test model


Find your API key

The API key is located on the web deployment screen, above the test button that you clicked on earlier. The API key input box comes with a copy to clipboard button, click on that button to copy the key. Paste the key into the “Add Your Own Model” page.


find API

Get your post URL

To grab the post-URL, click on the “REQUEST/RESPONSE” button, to the left of the test button. This will take you to the API help page.

Under “Request” and to the right of “POST” is the URL. Copy paste this URL into the “Add Your Own Model” form.


get POST url


get POST url - 1

Enjoy and share

You now have your very own web service! Rememto save the URL because it is your own web page that you may share with others.

If you have a free trial Azure ML account please note that your web service may discontinue when your free trial subscription ends.


June 15, 2022

All of these written texts are unstructured; text mining algorithms and techniques work best on structured data.

Text analytics for machine learning: Part 1

Have you ever wondered how Siri can understand English? How can you type a question into Google and get what you want?

Over the next week, we will release a five-part blog series on text analytics that will give you a glimpse into the complexities and importance of text mining and natural language processing.

This first section discusses how text is converted to numerical data.

In the past, we have talked about how to build machine learning models on structured data sets. However, life does not always give us data that is clean and structured. Much of the information generated by humans has little or no formal structure: emails, tweets, blogs, reviews, status updates, surveys, legal documents, and so much more. There is a wealth of knowledge stored in these kinds of documents which data scientists and analysts want access to. “Text analytics” is the process by which you extract useful information from text.

Some examples include:

All these written texts are unstructured; machine learning algorithms and techniques work best (or often, work only) on structured data. So, for our machine learning models to operate on these documents, we must convert the unstructured text into a structured matrix. Usually this is done by transforming each document into a sparse matrix (a big but mostly empty table). Each word gets its own column in the dataset, which tracks whether a word appears (binary) in the text OR how often the word appears (term-frequency). For example, consider the two statements below. They have been transformed into a simple term frequency matrix. Each word gets a distinct column, and the frequency of occurrence is tracked. If this were a binary matrix, there would only be ones and zeros instead of a count of the terms.

Make words usable for machine learning

Text Mining

Why do we want numbers instead of text? Most machine learning algorithms and data analysis techniques assume numerical data (or data that can be ranked or categorized). Similarity between documents is calculated by determining the distance between the frequency of words. For example, if the word “team” appears 4 times in one document and 5 times in a second document, they will be calculated as more similar than a third document where the word “team” only appears once.


Sample clusters

Text mining: Build a matrix

While our example was simple (6 words), term frequency matrices on larger datasets can be tricky.

Imagine turning every word in the Oxford English dictionary into a matrix, that’s 171,476 columns. Now imagine adding everyone’s names, every corporation or product or street name that ever existed. Now feed it slang. Feed it every rap song. Feed it fantasy novels like Lord of the Rings or Harry Potter so that our model will know what to do when it encounters “The Shire” or “Hogwarts.” Good, now that’s just English. Do the same thing again for Russian, Mandarin, and every other language.

After this is accomplished, we are approaching a several billion-column matrix; two problems arise. First, it becomes computationally unfeasible and memory intensive to perform calculations over this matrix. Secondly, the curse of dimensionality kicks in and distance measurements become so absurdly large in scale that they all seem the same. Most of the research and time that goes into natural language processing is less about the syntax of language (which is important) but more about how to reduce the size of this matrix.

Now that we know what we must do and the challenges that we must face in order to reach our desired result. The next three blogs in the series will be directly addressed to these problems. We will introduce you to 3 concepts: conforming, stemming, and stop word removal.

Want to learn more about text mining and text analytics?

Check out our short video on our data science bootcamp curriculum page OR watch our video on tweet sentiment analysis.

June 15, 2022

LIGO is a gravitational observatory. Learn about the fascinating science behind the detectors and what the future holds.

In this presentation hosted by Data Science Dojo, Dr. Muzammil A. Arain discussed:

• The fascinating science behind its detectors

• The basics of gravitational waves

• Its organizational structure, and what the future holds

• The data processing techniques employed by the observatory that enabled gravitational wave detection

LIGO gravitational waves presentation

About the speaker

Dr. Muzammil A. Arain was listed as an author in the announcement of the detection of gravitational waves. This is in recognition of his research work at the Department of Physics at the University of Florida from 2005-2010. He worked on the layout of recycling cavities and the provisioning of specific values for a possible design, when the detectors were undergoing an upgrade to improve sensitivity and bandwidth. The proposed design was adopted for the Advanced detectors that detected the gravitational waves.



June 15, 2022

Develop an understanding of text analytics, text conforming, and special character cleaning. Learn how to make text machine-readable.

Text analytics for machine learning: Part 2

Last week, in part 1 of our text analytics series, we talked about text processing for machine learning. We wrote about how we must transform text into a numeric table, called a term frequency matrix, so that our machine learning algorithms can apply mathematical computations to the text. However, we found that our textual data requires some data cleaning.

In this blog, we will cover the text conforming and special character cleaning parts of text analytics.

Understand how computers read text

The computer sees text differently from humans. Computers cannot see anything other than numbers. Every character (letter) that we see on a computer is actually a numeric representation to a computer, with the mapping between numbers and characters determined by an “encoding table.” The simplest, but most common, is ASCII encoding in text analytics. A small sample ASCII table is shown to the right.


To the left is a look at six different ways the word “CAFÉ” might be encoded in ASCII. The word on the left is what the human sees and its ASCII representation (what the computer sees) is on the right.

Any human would know that this is just six different spellings for the same word, but to a computer these are six different words. These would spawn six different columns in our term-frequency matrix. This will bloat our already enormous term-frequency matrix, as well as complicate or even prevent useful analysis.


ASCII Representation

Unify words with the same spelling

To unify the six different “CAFÉ’s”, we can perform two simple global transformations.

Casing: First we must convert all characters to the same casing, uppercase or lowercase. This is a common enough operation. Most programming languages have a built-in function that converts all characters into a string into either lowercase or uppercase. We can choose either global lowercasing or global uppercasing, it does not matter as long as it’s applied globally.

String normalization: Second, we must convert all accented characters to their unaccented variants. This is often called Unicode normalization, since accented and other special characters are usually encoded using the Unicode standard rather than the ASCII standard. Not all programming languages have this feature out of the box, but most have at least one package which will perform this function.

Note that implementations vary, so you should not mix and match Unicode normalization packages. What kind of normalization you do is highly language dependent, as characters which are interchangeable in English may not be in other languages (such as Italian, French, or Vietnamese).

Remove special characters and numbers

The next thing we have to do is remove special characters and numbers. Numbers rarely contain useful meaning. Examples of such irrelevant numbers include footnote numbering and page numbering. Special characters, as discussed in the string normalization section, have a habit of bloating our term-frequency matrix. For instance, representing a quotation mark has been a pain-point since the beginning of computer science.

Unlike a letter, which may only be capital or not capital, quotation marks have many popular representations. A quotation character has three main properties: curly, straight, or angled; left or right; single, double, or triple. Depending on the text analytics encoding used, not all of these may exist.

ASCII Quotations
Properties of quotation characters

The table below shows how quoting the word “café” in both straight quote and left-right quotes would look in a UTF-8 table in Arial font.

UTF 8 Form

Avoid over-cleaning

The problem is further complicated by each individual font, operating system, and programming language since implementation of the various encoding standards is not always consistent. A common solution is to simply remove all special characters and numeric digits from the text. However, removing all special characters and numbers can have negative consequences.

There is a thing as too much data cleaning when it comes to text analytics. The more we clean and remove the more “lost in translation” the textual message may become. We may inadvertently strip information or meaning from our messages so that by the time our machine learning algorithm sees the textual data, much or all the relevant information has been stripped away.

For each type of cleaning above, there are situations in which you will want to either skip it altogether or selectively apply it. As in all data science situations, experimentation and good domain knowledge are required to achieve the best results.

When do we want to avoid over-cleaning in your text analytics?

Special characters: The advent of email, social media, and text messaging have given rise to text-based emoticons represented by ASCII special characters.

For example, if you were building a sentiment predictor for text, text-based emoticons like “=)” or “>:(” are very indicative of sentiment because they directly reference happy or sad. Stripping our messages of these emoticons by removing special characters will also strip meaning from our message.

Numbers: Consider the infinitely gridlocked freeway in Washington state, “I-405.” In a sentiment predictor model, anytime someone talks about “I-405,” more likely than not the document should be classified as “negative.” However, by removing numbers and special characters, the word now becomes “I”. Our models will be unable to use this information, which, based on domain knowledge, we would expect to be a strong predictor.

Casing: Even cases can carry useful information sometimes. For instance, the word “trump” may carry a different sentiment than “Trump” with a capital T, representing someone’s last name.

One solution to filter out proper nouns that may contain information is through name entity recognition, where we use a combination of predefined dictionaries and scanning of the surrounding syntax (sometimes called “lexical analysis”). Using this, we can identify people, organizations, and locations.

Next, we’ll talk about stemming and Lemmatization as a way to help computers understand that different versions of words can have the same meaning (ex. run, running, runs).

Learn more

Want to learn more about text analytics? Check out the short video on our curriculum page OR

June 15, 2022