Data Analyst: In addition to doing what the BI guy did, a data analyst will also keep other factors like seasonality, segmentation, and visualization in mind. What if certain trends in shopping behavior are tied to seasonality? What if the trends are different across gender, demographics, geography, or product category? A data analyst will slice and dice the data to understand and annotate the report. Aside from database skills, a data analyst will have an understanding of some of the common visualization tools.
Business Analyst: A business analyst possesses the skills of the BI guy and the data analyst, plus they have domain knowledge and an understanding of the business. A business analyst may also have some basic skills in forecasting.
Data Mining or Big Data Engineer: A data miner will do what the data analyst did, possibly from unstructured data if needed. MapReduce and other big data skills may be needed. An understanding of common issues in running jobs on large scale data and debugging of MapReduce jobs is needed
Statistician (A traditional one): Pull data from a database or obtain it from any of the roles mentioned above and perform statistical analysis. This person ensures the quality of data and correctness of the conclusions by using standard practices like choosing the right sample size, confidence level, level of significance, type of test, and so on.
Traditionally statisticians did not possess CS background needed for writing a lot code. However, the situation has changed recently. Statistics departments at most schools have evolved so that statisticians graduate with strong programming skills and decent foundation skills in CS enabling them to perform the tasks that statisticians were not trained for traditionally.
Program/Project Manager: Look at the data provided by the professionals mentioned so far, align business with the findings, and influence the leadership to take appropriate action. This person possesses communication skills, presentation skills, and can influence without authority.
Ironically a PM is influencing business decisions using the data and insights provided by others. If the person does not have a knack for understanding data, chances are that they will not be able to influence others to make the correct decisions.
Now, putting it all together.
The rise of online services has brought a paradigm shift in the software development life cycle and business iteration over successive features and products. Having a different data puller, analyst, statistician, and project manager is just not possible any more. Now the mantra is: ship, experiment, and learn, adapt, ship, experiment, and learn… This situation has resulted in the birth of a new role, a Data Scientist.
A data scientist should have the skills of all the individuals I have mentioned so far. In addition to the skills mentioned above, a data scientist should have rapid prototyping and programming, machine learning, visualization, and hacking skills.
Domain Knowledge and Soft Skills Are As Important As Technical Skills: The importance of domain knowledge and soft skills, like communication and influencing without authority, are severely underestimated both by hiring managers and aspiring data scientists. Insights without domain knowledge can potentially mislead the consumers of these insights. Correct insights without the ability to influence decision making is just as bad as having no insights.
All of what I have said above is based on my own tenure as a data scientist at a major search engine and later with the advertising platform within the same company. I learned that sometimes people asking the question may not understand what they want to know. This sounds preposterous yet it happens way too often. Very often a bozo will start digging into something that is not related to the issue at hand just to prove that he/she is relevant. A data scientist encounters such HIPPOs (Highly Paid Person’s Opinions) that are somewhat unrelated to the problem and are very often a big distraction from the problem at hand.
A data scientist should possess the right soft skills to manage situations where people ask irrelevant, distracting questions that are outside the scope of the task at hand. This is hard, especially in situations where the person asking the question is several levels up the corporate ladder and is known to have an ego. It is a data scientist’s responsibility to manage up and around while presenting and communicating insights.
Below is a summary of necessary skills a data scientist should possess, in my opinion:
Curiosity About Data and Passion For Domain: If you are not passionate about the domain or business, and if you are not curious about data, then it is unlikely that you will succeed in a data scientist role. If you are working as a data scientist with an online retailer, you should be hungry to crunch and munch from the smorgasbord (of data of course) to know more. If your curiosity does not keep you awake, no skill in the world can help you succeed.
Soft Skills: Communication and influencing without authority are necessary skills. Understand the minimum action that has the maximum impact. Too many findings are as bad as no findings at all. The ability to scoop information out of partners and customers, even from the unwilling ones, is extremely important. The data you are looking for may not be sitting in one single place. You may have to beg, borrow, steal, and do whatever it takes to get the data.
Being a good story teller is also something that helps. Sometimes the insights obtained from data are counterintuitive. If you are not a good story teller, it will be difficult to convince your audience.
Math/Theory: Machine Learning. Stats and Probability 101. Optimization would be icing on the cake.
CS/Programming: You should know at least one scripting language (I prefer python). It is necessary to possess decent algorithms and DS skills in order to write code that can analyze a lot of data efficiently. You may not be a production code developer, but you should be able to write decent code. Database management and SQL skills are helpful. Knowledge of a statistical computing package is crucial; most people, including myself, prefer R. You should understand Excel or another spreadsheet software.
Big Data and Distributed Systems: Understand basic MapReduce concepts, Hadoop and Hadoop file system, and at least one language like Hive/Pig. Some companies have their own proprietary implementations of these languages. Knowledge of tools like Mahout and any of the XaaS, like Azure and AWS, would be helpful. Once again, big companies have their own XaaS, so you may be working on variants of any of these.
Visualization: Possess the ability to create simple yet elegant and meaningful visualization. Personally, R packages like ggplot, lattice, and others have helped me in most cases, but there are other packages that you can use. In some cases, you might want to use D3.
Below is a visualization of high level description of skills needed to become a data scientist: