R, Hadoop, and How They Work Together

binary  r hadoop

Apache Hadoop is a framework of open-source software for large-scale and storage processing on sets of data involving commodity hardware clusters as you will see in this article.

R is a suite of software and programming language for the purpose of data visualization, statistical computations and analysis of data. It has strong graphical capabilities, and is highly extensible with object-oriented features as this introduction to R course shows you. At its heart, R comes with a command line interpreter and is an interpreted language available for Mac, Windows and Linux machines.

If you are into predictive modelling or statistics, R offers a ton of benefits. In terms of the amount of package availabilities for applied statistics, R is basically unrivaled. R can also handle some tasks you used to need to do using other code languages. This is especially true for those who regularly use a different language to code and are using R for the first time.

Hadoop and R are a natural match and are quite complementary in terms of visualization and analytics of big data.

5 Ways Hadoop and R Work Together

There are five different ways of using Hadoop and R together:

  1. Hadoop Streaming: This is a utility that lets users run and develop the Map Reduce program in languages aside from Java as this Apache Hadoop training course shows you.
  2. Hadoop Streaming: Developed by David Rosenberg, Hadoop streaming are utilities available as R scripts that make it easy to use for R users.
  3. ORCH: Can be used on the non-Oracle Hadoop clusters or on the Oracle Big Data Appliance. As a matter of fact, ORCH is a Hadoop Oracle R connector.
  4. RHIPE: Techniques designed for analyzing large sets of data, RHIPE stands for R and Hadoop Integrated Programming Environment.
  5. RHadoop: Provided by Revolution Analytics, RHadoop is a great solution for open source hadoop and R. RHadoop is bundles with 4 primary packages of R to analyze and manage Hadoop framework data.

For Hadoop newbies who want to use R, here is one R Hadoop system is built on a Mac OS X in single-node mode.

Hadoop Installation

RHadoop is a 3 package-collection: rmr, rhbase and rhdfs. The package called rmr provides the Map Reduce functionality of Hadoop in R which you can learn about with this Hadoop course. Rhbase provides the R database management called HBase and Rhdfs provides the R file management called HDFS.

The first step is to get Hadoop installed and to do this you will need to download hadoop-1.2.tar.gz and then begin unpacking it. Next, you will need to set Java-Home and in conf / Hadoop _ env.sh, type this line:


After this step you will then need to enable self-log-in after setting up your remote desktop. Go to system preferences then under network and internet, click sharing. Under the services list, check ‘remote log-in.’ You can also click the ‘only these users’ buttons for extra security before choosing Hadoop.

You can also set up self-log-in and remote desktop by adding this line in conf/Hadoop_env.sh:


Run Hadoop

Check to see whether Hadoop is runs after you start utL vub. start -all. sh jps. You can do this right after you run Hadoop. To run Hadoop first you need to go to the Hadoop directory and start Hadoop. Type: Ch Hadoop – 1. 1. 2 bin/ Hadoop. You can then test it with a few examples, such as word count or for making pi calculations.

Example 1: Word Count

This code should return a word list and their frequencies. It begins by copying directories ‘conf’ to ‘input’ and then looks for a pattern ‘ d f s [ a – z ] + ’ by running the distributed ‘grep.’ This matches strings that begin with ‘d f s.’ To get more results change this to ‘ d f [ a – z ] + ’ or ‘ d [ a – z ] + ’

Here is the code:


Example 2: Making Calculations to Get Pi

In this code, the 1st argument (10) is the amount of maps and the next number is the number of samples per map. More accurate pi values can be acquired by setting larger values to the 2nd argument which would then take more time to run:


Stopping Hadoop

Now that you have the Hadoop system set-up, you can now stop it by typing: bin/ stop-all.sh. Next, install the R Hadoop package so that on the Hadoop system you can run jobs using R.

Installing R

With the method below, you can install multiple R versions on Mac. Especially if yours is a more updated R version and you plan to attempt it with v 2. 15. 2. On Hadoop, you can successfully run v1. 15. 1 and Rv1. 15. 2 using the procedure below.

Assume that on a Mac, you currently have Rv3. 0. 0. In Applications, first rename the R_64bitapp to R3. 0. 0_64bit app and rename the R app to R3. 0. 0. Next, install R v 2 . 15 . 2 before renaming the R_64bit.app and the R.app which you have just installed.

R Hadoop Installation

To avoid getting the message ‘Not Found: Make Command,’ when installing packages of R from sources, download GCC and begin its installation. Next, start the homebrew installation. Remember that administrator privileges need to be granted to the account of the current user using ‘su’ for installing homebrew. Under the Macintosh O S X terminal, run these commands:

  • Su <administrator_account>
  • Ruby -e “$ ( curl- fsSL https: // raw. github. com/ m x c l / homebrew / go )”
  • Brewupdate
  • brewdoctor

Begin the installation of the R package:

Install. Package (c ( “rJava”, “bitops”, “digest”, “Rcpp”, “RJSONIO,” “reshape2”, “plyr”, “stringr”, “functional”))

Environmental variables need to be set. For Hadoop, you can set environmental variables with the R functions below or in Terminal, with the ‘export’ command.

Package rhbase will fail to be installed without thrift:

Brew install git
Brew install pkg- config
Brew install thrift

Install the packages of RHadoop:

Download rmr2, rhbase and rhdfs and run the code for R:


For these R packages to load, make sure to install b running library () successfully

Running RHadoop Jobs

On Hadoop, you can now run an R job. Here is one example of running an R Map Reduce code for word count:




You should see a list of wordcounts after running the above code. Okay so now you have set up your own system of R Hadoop in the mode of single-node. You can now Enjoy map reducing with R.

Using RHadoop for Data Analysis

For analyzing data, you can use RHadoop. For example, you want to determine how many countries have a GDP greater than Apple Inc’s 2012 revenue of $156,508. The data needs to be adjusted to be suited for the algorithm of MapReduce. Here is the final format used for analyzing data:


This is how the script for gdp.R looks like:


A Hadoop streaming job is then initiated by R to the data process using the algorithm Map Reduce:


You then get the data that tells you how many have less and how many have greater Gross Domestic Product (GDP) than the 2012 revenue of Apple Inc. The results say that one hundred thirty-eight countries have less and fifty-five countries have greater GDP than Apple:


As you can see, when someone needs a combination of strong features of visualization and data analytics with big capabilities of data that is supported by Hadoop, it is a good idea to have a closer look at the features of RHadoop. There are packages integrating HBase, HDFS, Map Reduce which are the key Hadoop ecosystem components with R which you can learn more about with this analytics course.