Posts

Redmond Startup Enables Non-Profit Data Science Training

Redmond, Washington – March 7, 2016 – Data Science Dojo, an educational start-up, has selected its first non-profit applicant to receive tuition-free training in data science and data engineering.

Michael D’Andrea from the Sustainability Accounting Standards Board (SASB), a 501c(3) non-profit, will attend the company’s 5-day Seattle bootcamp, scheduled for March 28 – April 1, 2016. As a Data Science Dojo Fellow, he receives 50 hours of tuition-free training, books and materials and some travel expenses.

Why does Michael want to attend a Seattle bootcamp on data science?

DAndrea_pic-1“(SASB’s) mission is dependent on researching large amounts of structured and unstructured sustainability data for trends that support the greater disclosure of sustainability information for the public. However, as a 35-person organization without significant technology resources and awareness, it is challenging for us to explore possible means to scale our research efforts with data science.”

As data science becomes increasingly central to research and fundraising efforts, non-profits want to take advantage of these new tools. However, maintaining an in-house data science team is expensive. The Data Science Dojo Fellowship program offers non-profits the opportunity to train one of their own employees in data science and engineering.

By the end of the class, Michael may not be a data scientist, but he will be able to help SASB perform research more efficiently, saving time and money.

About the Data Science Dojo Fellowship Program:

The Data Science Dojo Fellowship program is open to students and non-profit employees. Four fellows are selected for each calendar year. To date, Data Science Dojo has selected six student fellows and one non-profit employee. Interested students and non-profit employees can submit their application at https://datasciencedojo.com/bootcamp/fellowship/

About Data Science Dojo:

Data Science Dojo is an education startup dedicated to enabling professionals to extract actionable insights from data. Our 5-day, intensive bootcamps and corporate trainings consist of hands-on labs, critical thinking sessions and a data engineering “hack day.” Graduates deploy predictive models and evaluate the effectiveness of different machine learning algorithms. Through these bootcamps, we are building a community of mentors, students and professionals committed to unleashing the potential of data science.

Deploy the Models!

Have you noticed that we have two demos on our site that allow you to deploy predictive models? The Titanic Survival Predictor is designed to work with a Microsoft Azure model. The AWS Machine Learning Caller is our new demo that connects to an Amazon Machine Learning model.

The idea is that you can use Microsoft Azure ML or Amazon ML to build a machine learning model, and then use our demo to input values for the prediction. Each ML program provides an endpoint that you can use to access the model and run predictions. Our demos interface with that endpoint and provide a graphic user interface for making predictions.

So what’s the difference between the demos?

First of all, the backend is pretty different. But we’ll keep this short and sweet.

The graphic below shows what types of models can be run through the demo.

  • The cruise ship represents the Titanic classification model generated from our Azure ML tutorial.
  • The iris represents any classification model, such as a model used to predict species from a set of measurements.
  • The complicated graph represents a regression model. Regression models are used to predict a number given a set of input numbers.

AWS_vs_Azure_Demo

 

You can see that the Titanic model can link to both demos, but the classification (iris) model only links to our Amazon demo. The numerical dataset does not work with either of our demos.

The demos are currently limited to classification models only (because linear regression models work differently and requires a different backend).

From the user perspective, the Titanic Survival Predictor is built for a specific purpose. It interfaces with the exact Titanic classification model that we created for Azure and is included as part of our bootcamp. Users can change all the tuning parameters and make the model unique. However, the input variables, or “schema” to be labeled the same way as the original model or it won’t work. So, if you rename one of the columns, the demo will have an error. However, since we published the Azure model online, it’s pretty easy to copy the model and change some parameters.

To get your predictive model to work with our Titanic Survival Predictor demo, you’ll need the following information:

  • Name (used to generate your own url)
  • Post URL (or endpoint)
  • API key

The AWS Machine Learning Caller is not built for a specific dataset like Titanic. It will work with any logistic regression model built in Amazon Machine Learning. When you input your access keys and model id, our demo automatically pulls the schema from Amazon. It does not require a specific schema like our Titanic Survival Predictor.

To get your predictive model to work with our AWS Machine Learning Caller demo, you’ll need the following information:

  • Access key
  • Secret access key
  • AWS Account Region
  • AWS ML Model ID

Why have two demos that do similar things?

These are training tools for our 5-day bootcamp. We use Microsoft Azure to teach classification models. The software has tools for data cleaning and manipulation. The way that the tools are laid out is visual and easy to understand. It provides a clear organization of the processes: input data, clean data, build a model, evaluate the model, and deploy the model. Microsoft Azure has been a great way to teach the model-building process.

We’ve recently added Amazon Machine Learning to our curriculum. The program is simpler, where all the processes described above are automated. Amazon ML walks users through the process. However, it does provide slightly different evaluation metrics than Microsoft Azure, so we use it to teach regression and classification models as well.

We are always looking for ways to incorporate new tools into our curriculum. If there is a tool that you think we ought to have, please let us know in the comments.

Understanding Individual Political Contribution by Occupation of Top 1% vs Bottom 99%

 

A political candidate not only needs votes, they need money.  In today’s multi-media world millions of dollars are necessary to run an effective campaign.   To win the election battle citizens will be bombarded with ads that cost millions.  Other mounting expenses including wages for staff, consultants, surveyors, grassroots activists, media experts, wonks, and policy analysts.  The figures are staggering with the next presidential election year campaigns likely to cost more than ten billion dollars.

ElectionCost1998_2014

Opensecret.org has summarized the money spent by presidential candidates, Senate and House candidates, political parties and independent interest groups that played an influential role in the federal elections by cycle.  Clearly, there’s no sign of less spending in future elections.

The 2016 presidential election cycle is already underway, the fund raising war has already begun.  Koch brothers’ political organization released $889 million budget in January 2015 supporting conservative campaigns in 2016 presidential contest.  As for primary presidential candidates, Hillary Clinton Campaign aims to raise at least $100 million for the primary election.  On the other side of the political aisle, analysts speculated primary candidate Jeb Bush will likely raised over $100 million when he discloses his financial position in July.

In my mind I imagine that money coming from millionaires and billionaires or mega-corporations intent on promoting candidates that favor their cause.  But who are these people really?  And how about the middle class citizen like me?  Does my paltry $200 amount to anything?  Does the spending power of the 99% have any impact on the outcome of an election?   Even as a novice I knew I would never understand American politics by listening to TV talking heads or the candidates and their say-nothing ads but by following the money.  By investigating real data about where the stream of money dominating our elections comes from and the role in plays in the success of an election, I hope to find some insight among all the political noise.   Thanks to the Federal Election Campaign Act, which requires candidate committees, party committees and political action committees (PAC’s) to disclose reports on the money they raise and spend and identify individuals who give more than $200 in an election cycle, a wealth of public data exists to explore.   I choose to focus on individual contributions to federal committees greater than $200 for election cycle 2011-2012.

The data is publicly available at http://www.fec.gov/finance/disclosure/ftpdet.shtml.

In the 2012 election cycle, which includes congressional and primary elections, the total amount of individual donations collected was USD 784 million.  USD 220 million came from the top 1% donors, which made up of 28% of the total contribution.  These elites wealthy donors were 7119 individuals, each had donated at least USD 10,000 to federal committees.  So, who are the top 1%?  What do they do for living that gave them such financial power to support political committees?  The unique occupation titles from the dataset are simply overwhelming and difficult to construct appropriate analysis.  Thus, these occupations were classified into 22 occupation groups according to the employment definition from Bureau of Statistics.  Additional categories were created due to lack of definition to classify them into appropriate group.  Among them are “Retired”, “Unemployed”, “Homemaker”, and “Politicians”.

Immediate from Figure 1 we observe the “Management” occupation group contributed the highest total amount in 2012 cycle for Democrats, Republicans and Others parties respectively.  Other top donors by occupation groups are “Business and Financial Operations”, “Retired”, “Homemaker”, “Politicians”, “Legal”.  Overall, Republicans Parties received more individual contribution from most of the occupation groups, with noticeably exception from “Legal” and “Arts, Design, Entertainment, Sports and Media”.  Total contribution given to “Other” non-Democratic/Republican was abysmal in comparison.

Figure 1: Total contribution of Top 1% by Occupation Group

top.sum

 

One might conclude that the reason for the “Management” group being the top donor is obvious given these people are CEOs, CFOs, Presidents, Directors, Managers and many other management title in a company.  According to the Bureau of Statistics, “Management” group earned the highest median wages among all other occupation groups.  Perhaps they simply had more to give.  Same argument could be applied to the “BUSINESS AND FINANCIAL OPERATIONS” group, which comprises of people who held job as investors, business owners, real estate developers, bankers, and etc.

Perhaps we could look at the individual contribution by occupation group from another angle.  When analyzing the average contribution by occupation group, “Politicians” group became top of the chart.  Individuals belonging to this category are either currently holding public office or they had declared candidacy for office with no other occupation reported.  Since there is no limit on how much candidates may contribute to their own committee, this group represents rich individuals funding their own campaigns.

Figure 2: Average contribution of Top 1% by Occupation Groups

top.ave

 

Suspiciously, the average amount per politicians given to Republicans committees is dramatically higher than other parties.  Further analysis indicated that the outlier is candidate John Huntsman, who donated about USD 5 million to his own committee Jon Huntsman for President Inc.   This has inflated the average contribution dramatically.  The same phenomenon was also observed among the “Management” group, where the average contribution to “Other” party was significantly higher compared to traditional parties.   Out of the five donors who contributed to an independent party from the “Management” group, William Bloomfield alone donated USD 1.3 million (out of the USD 1.45 million total amount collected) to his Bloomfield For Congress committee.  According to the data, he was the Chairman of Baron Real Estate.  This is an example of a wealthy elite spending a hefty sum of money to buy his way into the election race.  Donald Trump, a billionaire business mogul made headline recently by declaring his intention to run for presidency 2016 election.  He certainly has no trouble paying for his own campaign.

After excluding the occupation groups “Politicians” and “Management”, with intention to visualize the comparison among groups more clearly, the contrast became less dramatic.  No doubt, average contribution to Republicans Committees is consistently higher than other parties in most of the occupation groups.

Figure 3: Average contribution of Top 1% by Occupation Group excluding Politicians and Management group

top.ave.nopolitician

 

 

Could the similar story of the top 1% be told for the bottom 99%?  Overall, the top 5 contributors by occupation group are quite similar between top 1% and bottom 99%.  Once again “Management” group collectively raised most amount of donation to Democrats and Republicans Parties.  The biggest different here is that “Politicians” no longer the top contributor in the bottom 99% demographic.

Figure 4: Total contribution of bottom 99% by Occupation Group

ninty9.sum

Homemakers consistently rank high in both total contribution as well as average contribution, in both top 1% and bottom 99%.  On average, homemakers from bottom 99% donated about $1500 meanwhile homemakers from top 1% donated about $30,000 to their chosen political committees.  Clearly across all levels of socio-economic status spouses and stay at home parents play an important role in the fundraising war.  Since the term “Homemaker” is not well-defined, I can only assume their source of money comes from spouse, inherited wealth or personal savings.

Figure 5: Average contribution of bottom 99% by Occupation Group

ninty9.ave

 

Another observation we could draw from the average contribution from the 99% plot is that “Other” non-Democrats/Republicans Parties depending heavily on the 99% as source of funding for their political campaigns.  Third party candidates appear to be drawing most of their support from the little guy.

Figure 6: Median wages and Median Contribution by occupation group

wages

Another interesting question warranting further investigation is how the amount individual contributed to political committee proportionately consistent across occupation groups?  When we plotted median wages per occupation group side by side with median political contribution, the median of donation per groups are rather constant while the median income varies significantly across groups.  This implies that despite contributing the most overall, as a percentage of their income the wealthiest donors contributed the least.

The take home message from this analysis is that the top 1% wealthy elite seems to be driving the momentum of fundraising for election campaign.  I suspect most of them has full intention to support candidates who would look out for their personal interest, if indeed they got elected.  We middle class citizen may not have the ability to compete financially with these millionaires and billionaires, but our single vote is as powerful as their vote.  The best thing we could do as citizen is to educate ourselves with issues that matters to the future of our country.

We had also published a Political Party Affiliation Prediction Model demo at Data Science Dojo site.  For further information, please visit

http://demos.datasciencedojo.com/demo/political-party/

What are the key skills of a Data Scientist?

Truth be told, the industry does not have an agreed upon definition of a data scientist. Jokes such as “a data scientist is a data analyst living in the Silicon Valley” are abundant. Below is one such cartoon, just for fun.

 

a-data-scientist-is-a-business-analyst-that-lives-in-california[1]

Finding an ‘effective’ data scientist is hard. Finding people who understand who a data scientist is can be equally difficult. Note the use of ‘effective’ here. I use this word to highlight the fact that there could be people who might possess some of these skills yet may not be the best fit in a data science role. The irony is that even the people looking to hire data scientists do not understand data science. Hiring managers post job descriptions for traditional data analyst and business analyst roles while calling it a ‘Data Scientist’ position.

Instead of giving a list of skills with bullet points, I will highlight the difference between some of the data-related roles.

Consider the following scenario: Shop-Mart and Bulk-Mart are two competitors in the retail setting. Someone high up in the management chain asks this question: “How many Shop-Mart customers also go to Bulk-Mart?” Replace Shop-Mart and Bulk-Mart with WalMart, Target, Safeway or any retail outlets that you know of. The question might be of interest to management of one of these stores or even a third party. The third party could possibly be a market research or consumer behavior company, interested in gathering actionable insights about consumer behavior.

Here is how professionals in different data-related roles will approach the problem:

Traditional BI/Reporting Professional: Generate reports from structured data using SQL and some kind of reporting services (SSRS for instance) and send the data back to management. Management asks more questions based on the data that was sent, and the cycle continues. Insights about the data are most likely not included in the reports. A person in this role will be experienced mostly in database-related skills.

Data Analyst: In addition to doing what the BI guy did, a data analyst will also keep other factors like seasonality, segmentation, and visualization in mind. What if certain trends in shopping behavior are tied to seasonality? What if the trends are different across gender, demographics, geography, or product category? A data analyst will slice and dice the data to understand and annotate the report. Aside from database skills, a data analyst will have an understanding of some of the common visualization tools.

Business Analyst: A business analyst possesses the skills of the BI guy and the data analyst, plus they have domain knowledge and an understanding of the business. A business analyst may also have some basic skills in forecasting.

Data Mining or Big Data Engineer: A data miner will do what the data analyst did, possibly from unstructured data if needed. MapReduce and other big data skills may be needed. An understanding of common issues in running jobs on large scale data and debugging of MapReduce jobs is needed

Statistician (A traditional one): Pull data from a database or obtain it from any of the roles mentioned above and perform statistical analysis. This person ensures the quality of data and correctness of the conclusions by using standard practices like choosing the right sample size, confidence level, level of significance, type of test, and so on.

Traditionally statisticians did not possess CS background needed for writing a lot code. However, the situation has changed recently. Statistics departments at most schools have evolved so that statisticians graduate with strong programming skills and decent foundation skills in CS enabling them to perform the tasks that statisticians were not trained for traditionally.

Program/Project Manager: Look at the data provided by the professionals mentioned so far, align business with the findings, and influence the leadership to take appropriate action. This person possesses communication skills, presentation skills, and can influence without authority.

Ironically a PM is influencing business decisions using the data and insights provided by others. If the person does not have a knack for understanding data, chances are that they will not be able to influence others to make the correct decisions.

Now, putting it all together.

The rise of online services has brought a paradigm shift in the software development life cycle and business iteration over successive features and products. Having a different data puller, analyst, statistician, and project manager is just not possible any more. Now the mantra is: ship, experiment, and learn, adapt, ship, experiment, and learn… This situation has resulted in the birth of a new role, a Data Scientist.

A data scientist should have the skills of all the individuals I have mentioned so far. In addition to the skills mentioned above, a data scientist should have rapid prototyping and programming, machine learning, visualization, and hacking skills.

Domain Knowledge and Soft Skills Are As Important As Technical Skills: The importance of domain knowledge and soft skills, like communication and influencing without authority, are severely underestimated both by hiring managers and aspiring data scientists. Insights without domain knowledge can potentially mislead the consumers of these insights. Correct insights without the ability to influence decision making is just as bad as having no insights.

All of what I have said above is based on my own tenure as a data scientist at a major search engine and later with the advertising platform within the same company. I learned that sometimes people asking the question may not understand what they want to know. This sounds preposterous yet it happens way too often. Very often a bozo will start digging into something that is not related to the issue at hand just to prove that he/she is relevant. A data scientist encounters such HIPPOs (Highly Paid Person’s Opinions) that are somewhat unrelated to the problem and are very often a big distraction from the problem at hand.

A data scientist should possess the right soft skills to manage situations where people ask irrelevant, distracting questions that are outside the scope of the task at hand. This is hard, especially in situations where the person asking the question is several levels up the corporate ladder and is known to have an ego. It is a data scientist’s responsibility to manage up and around while presenting and communicating insights.

Below is a summary of necessary skills a data scientist should possess, in my opinion:

Curiosity About Data and Passion For Domain: If you are not passionate about the domain or business, and if you are not curious about data, then it is unlikely that you will succeed in a data scientist role. If you are working as a data scientist with an online retailer, you should be hungry to crunch and munch from the smorgasbord (of data of course) to know more. If your curiosity does not keep you awake, no skill in the world can help you succeed.

Soft Skills: Communication and influencing without authority are necessary skills. Understand the minimum action that has the maximum impact. Too many findings are as bad as no findings at all.  The ability to scoop information out of partners and customers, even from the unwilling ones, is extremely important. The data you are looking for may not be sitting in one single place. You may have to beg, borrow, steal, and do whatever it takes to get the data.

Being a good story teller is also something that helps. Sometimes the insights obtained from data are counterintuitive. If you are not a good story teller, it will be difficult to convince your audience.

Math/Theory: Machine Learning. Stats and Probability 101. Optimization would be icing on the cake.

CS/Programming: You should know at least one scripting language (I prefer python). It is necessary to possess decent algorithms and DS skills in order to write code that can analyze a lot of data efficiently. You may not be a production code developer, but you should be able to write decent code. Database management and SQL skills are helpful. Knowledge of a statistical computing package is crucial; most people, including myself, prefer R. You should understand Excel or another spreadsheet software.

Big Data and Distributed Systems: Understand basic MapReduce concepts, Hadoop and Hadoop file system, and at least one language like Hive/Pig. Some companies have their own proprietary implementations of these languages. Knowledge of tools like Mahout and any of the XaaS, like Azure and AWS, would be helpful. Once again, big companies have their own XaaS, so you may be working on variants of any of these.

Visualization: Possess the ability to create simple yet elegant and meaningful visualization. Personally, R packages like ggplot, lattice, and others have helped me in most cases, but there are other packages that you can use. In some cases, you might want to use D3.

Below is a visualization of high level description of skills needed to become a data scientist:

Data Science Skillset

Where is a data scientist in the big data pipeline? Below is a visualization of the big data pipeline, the associated technologies, and the regions of operation. In general, the depiction of where the data scientist belongs in this pipeline is largely correct, but there is one caveat. A data scientist should be comfortable diving into the ‘Collect’ and ‘Store’ territories if needed. Usually, data scientists would be working on transformed data and beyond. However, in scenarios where the business cannot afford to wait for the transformation process to finish, a data scientist has to turn to raw data to gather insights.

Big Data Technologies Platforms and Products