Learn to build large language model applications: vector databases, langchain, fine tuning and prompt engineering. Learn more

Big Data Scaling in R Using Hadoop and Spark


R is currently one of the most popular data science languages in the world. However, it’s always had constraints around scaling out to big data. What happens when you expand beyond a couple of gigabytes of data? You packed up your data and you used something else – Python, Java, or Mahout to name a few. Now it’s possible to stick with R throughout your production analysis all the way to deployment, regardless of the data size.

Companies like Apache, Revolution Analytics, Microsoft, and H20 showed us this year that distributed computing in R is possible. We’ll take a look at how we can scale R to Big Data Using Hadoop and Spark.

In this tutorial, we will show you Microsoft R Server, which is a Hadoop or Spark cluster where R is installed on every computer and is equipped with distributed processing libraries to utilize each and every computer in parallel. We’ll show you how to run your normal native R code via SSH, and how to get an RStudio server up and running on the cluster.

We’ll show you how to wrangle data out of an HDFS and build machine learning models from your large dataset. Then show you how to pack up that model and deploy it to an elastically scaled web service so that anyone may call upon it for predictions and insights.

What you’ll learn

  • Set up a Spark cluster with R installed (R server)
  • Wrangle data that is inside HDFS using R
  • Build and deploy a machine learning model using R
Data Science Dojo
Phuc H Duong
Phuc holds a Bachelors degree in Business with a focus on Information Systems and Accounting from the University of Washington.

We are looking for passionate people willing to cultivate and inspire the next generation of leaders in tech, business, and data science. If you are one of them get in touch with us!