Ever since eCommerce websites were extensively accessed by users, Hadoop has gained popularity. Hadoop is a software platform that lets developers write and run applications, which can manipulate large amounts of data. Hadoop offers scalability, efficiency, and reliability. It can distribute data across multiple computers and process petabytes of data. Hadoop has a special feature where data is spread across clusters of available computers, making data available even when one of the systems is facing downtime.
When large amounts of data needs to be handled, we need to work with computer clusters. Clusters are machines of similar configurations tightly-coupled with dedicated network configurations. These machines share common resources such as a source directory. In Hadoop, with the amount of data that it can process, working with clusters is something every eCommerce website wants to leverage. In this article, we will cover how to install a Hadoop cluster that can range from a few nodes to thousands of nodes.
- Hadoop supports Linux as its recognized platform for development and production. If you want to use Win32 as your development platform, you can install the Hadoop archive on your system. But, if you want to use Win32 as a production platform, you will have to use Cygwin for SSH support.
- Hadoop was written in Java, and hence for using it, you need to install JDK v1.6 onwards. Sun JDK is most preferred. Also, ensure that SSH is installed and sshd is running.
Installing Hadoop on cluster nodes
Since we are setting up Hadoop on clusters, installation involves unpacking the Hadoop archive or installing it using RPM Package Manager. When you install Hadoop on all machines that are part of the cluster, one of the machine becomes the NameNode, while one more becomes the ResourceManager. These are the master nodes in the cluster. All the other nodes in the cluster become the slaves and act as both DataNode and NodeManager.
Configuring Hadoop for Clusters
Basically, you can set up clusters in Hadoop for either HDFS configuration or YARN configuration. Let us understand configurations required in both the cases:
For using the Hadoop cluster with HDFS configuration, we need to modify the fs.defaultFS parameter in the $HADOOP_PREFIX/etc/hadoop/core-site.xml file. This parameter needs to point to the endpoint of the node that is running the NameNode daemon. Here is the sample syntax:
TO use the YARN configuration, we need to modify the $HADOOP_PREFIX/etc/hadoop/yarn-site.xml file to point to the resources of each node. This means the sample syntax for the file will look as follows:
<description>The hostname of the RM.</description>
In the above code, resourcemanager is nothing but the hostname for each of the respective node.
Starting the Hadoop cluster
Before we can begin with starting the cluster, we need to make sure that all the nodes have the same endpoint configuration. Once this is confirmed, you can start all the daemons for each node.
We need to execute the following commands at command line using SSH.
Starting HDFS daemons
# You must format the NameNode director. This needs to be done only once for only the NameNode node.
$HADOOP_PREFIX/bin/hdfs namenode -format
# Starting the NameNode daemon.
#This command needs to be executed on the NameNode node.
$HADOOP_PREFIX/sbin/hadoop-daemon.sh start namenode
# Starting the DataNode daemon
# This command needs to be executed for all slaves.
$HADOOP_PREFIX/sbin/hadoop-daemon.sh start datanode
Starting YARN daemons
# Starting the ResourceManager daemon
# This command needs to be execute only on the ResourceManager node.
$HADOOP_PREFIX/sbin/yarn-daemon.sh start resourcemanager
# Starting the NodeManager daemon
# This command needs to be executed on all slaves.
$HADOOP_PREFIX/sbin/yarn-daemon.sh start nodemanager
Starting Hadoop daemons using scripts
Hadoop daemons can be started manually using the above set of commands. But, when we work in a cluster, the number of nodes can range into thousands. If you think of starting each node manually, the magnitude of this task will be unimaginable. Hence, we can use the Hadoop scripts to automate this task. Hadoop provides certain scripts that can be used to start all the nodes. The only assumption these scripts makes is that the ResourceManager node has SSH access to all the nodes in the cluster, which includes itself too. Also, it also mandates that all the slave hostnames are added in the $HADOOP_PREFIX/etc/hadoop/slaves file as follows:
Now, after these pre requisites are met with, to start the daemons, you need to SSH to the ResourceManager node. After you have SSH access, run the following commands.
# Starting all HDFS services for the cluster
# Start all YARN services for the cluster
Sometimes, when you execute this command, you may face an incorrect JAVA_HOME variable error. TO correct this error, modify the JAVA_HOME variable from the environment variables to point to the location where JDK is installed.
Bind exception when using clusters
The most common exception when using multiple machines with multiple port is the Bind exception. Port is in use is commonly seen error and here are a few solutions that can help you resolve it.
- Make sure no other application is using the mentioned port. At SSH, enter the fuser -v -n tcp <port number> command.
- Make sure that the hostname of your node points to the internal address, which is the IP address. This kind of information can be found in the etc/hosts file.
Testing the cluster set up
You can test if your cluster is working fine by using one of the sample application packaged with Hadoop installation, named DistributedShell. If you want to test for all of your slaves, ensure that tasks are running in each of slave. Also, increase the number of containers in the sample application. The DistributedShell client can be called from any of the cluster nodes.
The result of this test would be time strings that show the current time for each of the container, which corresponds to each slave.