Truth be told, the industry does not have an agreed upon definition of a data scientist. Jokes such as “a data scientist is a data analyst living in the Silicon Valley” are abundant. Below is one such cartoon, just for fun.
Finding an ‘effective’ data scientist is hard. Finding people who understand who a data scientist is can be equally difficult. Note the use of ‘effective’ here. I use this word to highlight the fact that there could be people who might possess some of these skills yet may not be the best fit in a data science role. The irony is that even the people looking to hire data scientists do not understand data science. Hiring managers post job descriptions for traditional data analyst and business analyst roles while calling it a ‘Data Scientist’ position.
Instead of giving a list of skills with bullet points, I will highlight the difference between some of the data-related roles.
Consider the following scenario: Shop-Mart and Bulk-Mart are two competitors in the retail setting. Someone high up in the management chain asks this question: “How many Shop-Mart customers also go to Bulk-Mart?” Replace Shop-Mart and Bulk-Mart with WalMart, Target, Safeway or any retail outlets that you know of. The question might be of interest to management of one of these stores or even a third party. The third party could possibly be a market research or consumer behavior company, interested in gathering actionable insights about consumer behavior.
Here is how professionals in different data-related roles will approach the problem:
Traditional BI/Reporting Professional: Generate reports from structured data using SQL and some kind of reporting services (SSRS for instance) and send the data back to management. Management asks more questions based on the data that was sent, and the cycle continues. Insights about the data are most likely not included in the reports. A person in this role will be experienced mostly in database-related skills.
Data Analyst: In addition to doing what the BI guy did, a data analyst will also keep other factors like seasonality, segmentation, and visualization in mind. What if certain trends in shopping behavior are tied to seasonality? What if the trends are different across gender, demographics, geography, or product category? A data analyst will slice and dice the data to understand and annotate the report. Aside from database skills, a data analyst will have an understanding of some of the common visualization tools.
Business Analyst: A business analyst possesses the skills of the BI guy and the data analyst, plus they have domain knowledge and an understanding of the business. A business analyst may also have some basic skills in forecasting.
Data Mining or Big Data Engineer: A data miner will do what the data analyst did, possibly from unstructured data if needed. MapReduce and other big data skills may be needed. An understanding of common issues in running jobs on large scale data and debugging of MapReduce jobs is needed
Statistician (A traditional one): Pull data from a database or obtain it from any of the roles mentioned above and perform statistical analysis. This person ensures the quality of data and correctness of the conclusions by using standard practices like choosing the right sample size, confidence level, level of significance, type of test, and so on.
Traditionally statisticians did not possess CS background needed for writing a lot code. However, the situation has changed recently. Statistics departments at most schools have evolved so that statisticians graduate with strong programming skills and decent foundation skills in CS enabling them to perform the tasks that statisticians were not trained for traditionally.
Program/Project Manager: Look at the data provided by the professionals mentioned so far, align business with the findings, and influence the leadership to take appropriate action. This person possesses communication skills, presentation skills, and can influence without authority.
Ironically a PM is influencing business decisions using the data and insights provided by others. If the person does not have a knack for understanding data, chances are that they will not be able to influence others to make the correct decisions.
Now, putting it all together.
The rise of online services has brought a paradigm shift in the software development life cycle and business iteration over successive features and products. Having a different data puller, analyst, statistician, and project manager is just not possible any more. Now the mantra is: ship, experiment, and learn, adapt, ship, experiment, and learn… This situation has resulted in the birth of a new role, a Data Scientist.
A data scientist should have the skills of all the individuals I have mentioned so far. In addition to the skills mentioned above, a data scientist should have rapid prototyping and programming, machine learning, visualization, and hacking skills.
Domain Knowledge and Soft Skills Are As Important As Technical Skills: The importance of domain knowledge and soft skills, like communication and influencing without authority, are severely underestimated both by hiring managers and aspiring data scientists. Insights without domain knowledge can potentially mislead the consumers of these insights. Correct insights without the ability to influence decision making is just as bad as having no insights.
All of what I have said above is based on my own tenure as a data scientist at a major search engine and later with the advertising platform within the same company. I learned that sometimes people asking the question may not understand what they want to know. This sounds preposterous yet it happens way too often. Very often a bozo will start digging into something that is not related to the issue at hand just to prove that he/she is relevant. A data scientist encounters such HIPPOs (Highly Paid Person’s Opinions) that are somewhat unrelated to the problem and are very often a big distraction from the problem at hand.
A data scientist should possess the right soft skills to manage situations where people ask irrelevant, distracting questions that are outside the scope of the task at hand. This is hard, especially in situations where the person asking the question is several levels up the corporate ladder and is known to have an ego. It is a data scientist’s responsibility to manage up and around while presenting and communicating insights.
Below is a summary of necessary skills a data scientist should possess, in my opinion:
Curiosity About Data and Passion For Domain: If you are not passionate about the domain or business, and if you are not curious about data, then it is unlikely that you will succeed in a data scientist role. If you are working as a data scientist with an online retailer, you should be hungry to crunch and munch from the smorgasbord (of data of course) to know more. If your curiosity does not keep you awake, no skill in the world can help you succeed.
Soft Skills: Communication and influencing without authority are necessary skills. Understand the minimum action that has the maximum impact. Too many findings are as bad as no findings at all. The ability to scoop information out of partners and customers, even from the unwilling ones, is extremely important. The data you are looking for may not be sitting in one single place. You may have to beg, borrow, steal, and do whatever it takes to get the data.
Being a good story teller is also something that helps. Sometimes the insights obtained from data are counterintuitive. If you are not a good story teller, it will be difficult to convince your audience.
Math/Theory: Machine Learning. Stats and Probability 101. Optimization would be icing on the cake.
CS/Programming: You should know at least one scripting language (I prefer python). It is necessary to possess decent algorithms and DS skills in order to write code that can analyze a lot of data efficiently. You may not be a production code developer, but you should be able to write decent code. Database management and SQL skills are helpful. Knowledge of a statistical computing package is crucial; most people, including myself, prefer R. You should understand Excel or another spreadsheet software.
Big Data and Distributed Systems: Understand basic MapReduce concepts, Hadoop and Hadoop file system, and at least one language like Hive/Pig. Some companies have their own proprietary implementations of these languages. Knowledge of tools like Mahout and any of the XaaS, like Azure and AWS, would be helpful. Once again, big companies have their own XaaS, so you may be working on variants of any of these.
Visualization: Possess the ability to create simple yet elegant and meaningful visualization. Personally, R packages like ggplot, lattice, and others have helped me in most cases, but there are other packages that you can use. In some cases, you might want to use D3.
Below is a visualization of high level description of skills needed to become a data scientist:
Where is a data scientist in the big data pipeline? Below is a visualization of the big data pipeline, the associated technologies, and the regions of operation. In general, the depiction of where the data scientist belongs in this pipeline is largely correct, but there is one caveat. A data scientist should be comfortable diving into the ‘Collect’ and ‘Store’ territories if needed. Usually, data scientists would be working on transformed data and beyond. However, in scenarios where the business cannot afford to wait for the transformation process to finish, a data scientist has to turn to raw data to gather insights.