What machine learning tool should I learn?

What tool should I learn? This is one of the most common questions that aspiring data scientists will ask, and one which doesn’t seem to have easy answers. A Google search for “Data Science Tools” returns sites talking about R, Python, SQL, Clojure, Julia, Apache Spark, and more. It’s a dizzying array of answers, and to make it worse, there is little advice about which to use. I’m here to correct that.

Let’s start with a disclaimer.

Data science is closely related to traditional statistics and business analytics, and many of the techniques of the latter are very useful in the former. However, to do data science well, you have to learn to program, at least a little bit. Purely graphical programs like Excel, SAS, and other traditional analytics platforms are great for data exploration and visualization, but are limited in their ability to build predictive models – or analyze terabyte sized datasets.

As a brand new data science student, you should not invest in learning Clojure or Scala. These tools have steep learning curves. It’s likely that you’ll spend all your time on configuration and syntax, and little or none on actual analysis. Even if you already have a great deal of experience with Clojure or Scala, you will likely be better off with a different tool initially. Steer clear.

SQL is one of the most commonly cited languages on data science job postings, and eventually, you will want to develop practical SQL skills. However, SQL is a language of data selection and segmentation, not a language of analysis. If you want to start learning practical data science analysis, don’t start with SQL.

In truth, there are three practical tools for learning data science: R, Python, and Julia. Each of these is a fully-featured scripting language providing a flexible and practical command line.
Let’s take them one by one.

Julia is the newest language of the three by more than a decade, and so has the least amount of available support. It is also still undergoing very active development, which makes it the least stable of the three. Unless you have a lot of experience working with new and unstable languages, avoid Julia. It’s a powerful platform with a lot of potential, but not yet optimal for learning.

Python has been one of the most popular programming languages in the world for several years. It is well developed, stable, and powerful. It has several mature scientific analysis packages in the SciPy stack (SciPy, NumPy, Matplotlib, and Pandas) which are well supported and have a lot of active users. Finding online help is fairly easy. The main machine learning package, scikit-learn, has a large number of available algorithms as well as good model evaluation frameworks. Python also has great text processing and file handling capabilities. However, Python does have some significant drawbacks from an analytics point of view. The SciPy stack and scikit-learn are optimized for working with quantitative scientific data, consisting entirely of numbers. As a result, Python’s ability to handle data which cannot be encoded as numbers is restricted. Scikit-learn has no built in support for Pandas’ Categorical or NumPy’s Object type. This makes working with categorical data a hassle. In addition, the visualization libraries available for Python (Matplotlib, Pandas, and Seaborn are three) are limited compared to those available for R. If you are already experienced with using Python’s SciPy stack, stick with what you know. If you are coming to Data Science without Python experience, think hard before choosing to start your learning here.

That brings us to R.

R is the oldest of the languages, an open source language in use since the early 1990s. R is designed to be a statistical analysis tool, making the transition to data science a natural one. R has a massive array of available packages, many of them of high quality. It has a built in categorical datatype (factor), and most libraries are compatible with it. As a result, handling categorical data in R is much less of a hassle than in Python. In addition, the lattice and ggplot2 graph libraries set R far ahead of the other languages for data visualization. R’s language is also more forgiving than Python’s.

For all its advantages, there are some disadvantages. R is resolutely a single-core program, unable to create threads for parallel processing. Additionally, R loads all data into active memory, so the size of datasets you can analyze is limited by your computer’s RAM. For the aspiring data scientist, however, most beginner’s datasets will be less than the 1-2 GB size where R starts to have memory problems. Additionally, the streamlined design of the language allows you to focus on learning how to analyze data, tune predictive models, and improve your data science fundamentals. If you are not already a Python programmer, and you want to get started in data science as quickly as possible, choose R.

Since R is the fastest language to learn for most people, it is the primary language used in our bootcamp. By the end of the first day, people are already developing plots in ggplot2 and lattice. All of our hands-on labs have code in both R and Python, but we find that most students learn better with R.

In the end, the choice of what tool you start learning is as important as it can seem. A good data scientist is not good because they use R, Python, Julia, or any other tool. They’re good because they understand how to attack a dataset, dissect it, and analyze it, and they know how to use that knowledge to choose and tune predictive models. These skills are independent of platform, and with enough dedication (and perhaps a little help), you will be able to learn them regardless of the tool you use.