Udemy logo

hadoop interview questionsApache Hadoop is a free, open-source, Java-based software framework used to store, maintain, and process large-scale sets of data across numerous clusters of commodity hardware. If you’ve applied for a job as an analytics engineer, a Hadoop developer, or even a Java engineer who might end up using Hadoop on the job, it’s worth it to review some Hadoop interview questions in preparation.

Even if you’re still training to be a Hadoop developer, you can still use this guide as a way to review important points while you study. Don’t forget to check out a course like this Hadoop training tutorial for additional study help.

Why is Hadoop useful?

Hadoop is fault tolerant, meaning the system will simply redirect to another location and resume work when a node is lost. Hadoop is also schema-less and can absorb data of all types, sources, and structures, allowing for deeper analysis.

If you want to learn more reasons why Hadoop is useful, check out this course on working with big data.

Which directory does Hadoop install to?

Hadoop is installed in cd/usr/lib/hadoop-0.20/

What are the four modules that make up the Apache Hadoop framework?

Which modes can Hadoop be run in? List a few features for each mode.

Where are Hadoop’s configuration files located?

Hadoop’s configuration files can be found inside the conf sub-directory.

List Hadoop’s three configuration files.

What are “slaves” and “masters” in Hadoop?

In Hadoop, slaves are a list of hosts for task tracker servers and datanodes. Masters list hosts for secondary namenode servers.

What is /etc/init.d?

/etc/init.d is a Linux directory. In Hadoop, you use this to check the status of daemons or check where they’re located.

What is a Namenode?

Namenode exists at the center of the Hadoop distributed file system cluster. It manages metadata for the file system, and datanodes, but does not store data itself.

How many Namenodes can run on a single Hadoop cluster?

Only one Namenode process can run on a single Hadoop cluster. The file system will go offline if this Namenode goes down.

What is a datanode?

Unlike Namenode, a datanode actually stores data within the Hadoop distributed file system. Datanodes run on their own Java virtual machine process.

How many datanodes can run on a single Hadoop cluster?

Hadoop slave nodes contain only one datanode process each.

What is job tracker in Hadoop?

Job tracker is used to submit and track jobs in MapReduce.

How many job tracker processes can run on a single Hadoop cluster?

Like datanodes, there can only be one job tracker process running on a single Hadoop cluster. Job tracker processes run on their own Java virtual machine process. If job tracker goes down, all currently active jobs stop.

What sorts of actions does the job tracker process perform?

Learn more about Hadoop with this course on the fundamentals of Apache Hadoop.

How does job tracker schedule a job for the task tracker?

When a client application submits a job to the job tracker, job tracker searches for an empty node to schedule the task on the server that contains the assigned datanode.

What does the mapred.job.tracker command do?

The mapred.job.tracker command will provide a list of nodes that are currently acting as a job tracker process.

What is “PID”?

PID stands for Process ID.

What is “jps”?

jps is a command used to check if your task tracker, job tracker, datanode, and Namenode are working.

Is there another way to check whether Namenode is working?

Besides the jps command, you can also use: /etc/init.d/hadoop-0.20-namenode status.

How would you restart Namenode?

To restart Namenode, you could either write:

and then press Enter, or you could simply click stop-all.sh, then select start-all.sh.

What is “fsck”?

fsck standards for File System Check.

What are the port numbers for job tracker, task tracker, and Namenode?

The port number for job tracker is 30, the port number for task tracker is 60, and the port number for Namenode is 70.

What is a “map” in Hadoop?

In Hadoop, a map is a phase in HDFS query solving. A map reads data from an input location, and outputs a key value pair according to the input type.

What is a “reducer” in Hadoop?

In Hadoop, a reducer collects the output generated by the mapper, processes it, and creates a final output of its own.

What are the parameters of mappers and reducers?

The four parameters for mappers are:

The four parameters for reducers are:

Is it possible to rename the output file, and if so, how?

Yes, it is possible to rename the output file by utilizing a multi-format output class.

List the network requirements for using Hadoop.

Which port does SSH work on?

SSH works on the default port number, 22.

What is streaming in Hadoop?

As part of the Hadoop framework, streaming is a feature that lets engineers code with MapReduce in any language, as long as that programming language is able to accept and produce standard output. Even though Hadoop is Java-based, the chosen language doesn’t have to be Java. It can be Perl, Ruby, etc. If you want to use customization in MapReduce, however, Java must be used.

Hadoop is Java-based, remember, so it’s pretty useful to know Java if you want to work with the framework. Check out this guide on learning Java for more information.

What is the difference between Input Split and an HDFS Block?

InputSplit and HDFS Block both refer to the division of data, but InputSplit handles the logical division while HDFS Block handles the physical division.

What does the file hadoop-metrics.properties do?

The hadoop-metrics.properties file controls reporting in Hadoop.


You can prepare for your analytics engineering interview with the above Hadoop interview questions, or check out this Apache Hadoop essentials training course for additional reference material and guidance. If you’re looking for a more expert-level approach, check out this advanced Hadoop course

For more on data analytics, check out this beginners data analytics course, or this data analytics training course for all skill levels.

Page Last Updated: March 2014

Top courses in Apache Hadoop

The Ultimate Hands-On Hadoop: Tame your Big Data!
Sundog Education by Frank Kane, Frank Kane, Sundog Education Team
4.5 (30,286)
Bestseller
Master Apache Spark using Spark SQL and PySpark 3
Durga Viswanatha Raju Gadiraju, Madhuri Gadiraju, Sathvika Dandu, Pratik Kumar, Sai Varma, Phani Bhushan Bozzam, Siva Kalyan Geddada
4.2 (2,343)
Bestseller
Cloudera Hadoop Administration
Sijeesh Kunnotharamal
4.5 (912)
Hadoop Developer In Real World
Big Data In Real World
4.6 (3,189)
Highest Rated

More Apache Hadoop Courses

Apache Hadoop students also learn

Empower your team. Lead the industry.

Get a subscription to a library of online courses and digital learning tools for your organization with Udemy Business.

Request a demo