Getting Started with Hadoop Installation

hadoop interview questions

There is a lot of buzz in the industry about Hadoop and big data. Hadoop is a software solution that is used for distributed computing when dealing with large sets of data. If we analyze large data, which can be a mix of complex structured data, Hadoop can be used for targeted analysis.  The best advantage of Hadoop’s architecture is that it can easily run on multiple machines that have no common hard disk or storage.

New to Big Data and Hadoop? Take a fundamental tutorial at Udemy.com.

Hadoop provides support two file systems: Hadoop Distributed File System (HDFS) and API support for running MapReduce jobs. HDFS is similar to UNIX distributed system, but data is stored across multiple machines in HDFS. HDFS also has built-in techniques that can handle machine outages and is optimized to support throughput and not latency. MapReduce jobs generally split the input into single units and are processed by a map in parallel. After this, a framework sorts the map output, which are then passed to the reduce tasks. The task again reduces this input into further smaller units.

Types of Installations and Prerequisites

Hadoop generally can be installed to work for the following components:

  • Hadoop Core
  • HDFS
  • MapReduce

Here are the installation prerequisites:

  1. Supported platforms: GNU/Linux is a supported platform for development and production. You can also use Win32 as a development platform, but for production, you will have to use Cygwin.
  2. Software required: Since Hadoop is written in Java, Sun JDK v1.6 or more current versions must be installed. You also need SSH installed along with sshd running.
  3. Cygwin: This is a mandatory requirement if you are using Win32 as a development platform.

Installing Hadoop

Before we can go ahead with configuring properties for Core, HDFS, or MapReduce components, we need to install Hadoop for either Standalone mode, Pseudo-Distributed mode, or Fully Distributed mode (Cluster mode). In this article, we will cover installation for these three components in Standalone and Pseudo-Distributed mode.

Download a stable release for Hadoop and unpack it on your machine using the following command.

% tar xzf hadoop-a.b.c.tar.gz

For Hadoop to work, it needs to know where the Java home directory is. The JAVA_HOME environment variable needs to be set using the following command.

export JAVA_HOME=/usr/lib/jvm/java-6-sun

You also need to set the Hadoop install directory path using the following command.

% export HADOOP_INSTALL=/home/user/hadoop-a.b.c
% export PATH=$PATH:$HADOOP_INSTALL/bin

Now, check if Hadoop is installed correctly by typing % Hadoop version at command line. The Hadoop welcome text should be displayed.

Let us see the three different modes in which Hadoop can be run.

Learn more about the MapReduce component and take an online course at Udemy.com.

Standalone Mode

This mode is also known as the local mode. It means everything runs on a single JVM and no daemons run as well. In UNIX, daemons are processes that are always available to service requests. This is mostly used for executing MapReduce jobs due to ease in debugging.

Pseudo-Distributed Mode

This mode replicates a cluster environment, with daemons running, on a local machine.

Fully Distributed Mode

This is the cluster development mode where daemons run on a cluster of machines.

For your components to run in any of these modes, you need to set the appropriate property files and start the daemons. Though there are multiple files, the key files affected are fs.default.name, dfs.replication, and mapred.job.tracker respectively for Core, HDFS, and MapReduce component.

Let’s understand what specific changes need to be made to start using the components in each of the modes.

Standalone Mode

Since the standalone mode works with the default options, we do not need to provide additional configurations. Also, Hadoop daemons are not started in this mode.

Pseudo-Distributed Mode

To start working in this mode, you need to make some changes to the configuration files and place it in the config directory. This directory will be accessed when we start the daemons with the  – – config option.

Core component config file – The config file name for this component is core-site.xml and can be found in the config directory.

<?xml version="1.0"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost/</value>

</property>
</configuration>

HDFS component config file – The config file name for this component is hdfs-site.xml and can be found in the config directory.

<?xml version="1.0"?>
<configuration>
<property>
<name>dfs.replication</name>

<value>1</value>
</property>
</configuration>

MapReduce component config file – The config file name for this component is hdfs-site.xml and can be found in the config directory.

<?xml version="1.0"?>
<configuration>
<property>

<name>mapred.job.tracker</name>
<value>localhost:8021</value>
</property>
</configuration>

Configuration for Hadoop Daemons

After you edit or create the config files, you need to start the Hadoop daemons. For starting the daemons, you need to use SSH. Hadoop does not differentiate between Pseudo mode and fully distributed mode. The only difference between the two is that the Pseudo mode uses localhost while the fully distributed mode uses multiple servers. So, when you start daemons, you must ensure that you SSH to localhost in case of the Pseudo mode.

Perform the following steps at command line to configure SSH:

<!- – To check if SSH is installed – -!>

%sudo apt-get install ssh

<!- – This enables password less login to localhost – -!>

%ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa % cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Now, try the following command to check if you can login to localhost without any password.

%ssh localhost 

If you plan on using the HDFS component, you need to start formatting the HDFS installation. Type the following command at command line:

%hadoop namenode -format

Starting and Stopping Daemons

For both HDFS and MapReduce mode, the following commands need to be used to start the daemons.

%start-dfs.sh %start-mapred.sh

To check if the daemons were started successfully, open the logfiles from the log directory. You can also use the Java JPS command to see if the daemons are started.

If you want to stop the daemons, use the following commands:

%stop-dfs.sh %stop-mapred.sh

The fully distributed mode requires multiple configurations that cannot be covered in a single article. Hadoop and big data are widely used for data analysis in variety of sectors and have helped in providing better services to customers through these analysis.

Take a course at Udemy.com to become expert at Hadoop.