Hadoop Python: Extending Hadoop High Performance Framework with Python API

hadoop pythonHadoop is an Apache software development framework for a clustering storage and large-scale processing of data-sets in multiple hardwares. Hadoop was created in 2005 for Nutch search engine in Apache to enhance its search capability across multiple servers. Hadoop is aimed as framework that enables High Performance Computing through distributed computing.

In order to master Hadoop you will need to learn Hadoop and get the basic idea of its framework and capabilities.

The idea of High Perfomance Computing across the network was first implemented in Beowulf project in 1994. The Beowulf project is a cluster computer providing a multi-computer architecture that able to perform parallel computations across the network cluster in the LAN.

Hadoop Framework

The beginning of Hadoop was initiated as Doug Cutting and Mike Cafarella created a clustering functionality for the Nutch search engine project in 2005. Nutch is an open source search engine project initiated by Apache.  Hadoop was written entirely in Java. With Hadoop, Nutch was expected to be able to provide a faster search engine through distributed computing in a cluster.

Hadoop was named after a toy elephant belong to Cafarella’s son. Hadoop was written entirely in Java and comprises of 4 core modules, which are:

  • Hadoop Common – libraries and utilities that contains necessary Java Archive (JAR) file and scripts required to run Hadoop
  • Hadoop Distributed File System (HDFS)  –  a distributed file-system that stores data on multiple machines in the cluster. The distributed file system provides the highest aggregate bandwidth across the cluster.
  • Hadoop YARN – a resource-management platform that manage the computing resources on the cluster and assigning tasks and scheduling application to the clusters
  • Hadoop MapReduce – a programming model for large scale data processing.

One key feature in Hadoop is its MapReduce that enables Hadoop to work differently than other parallel distributed computing and provides Hadoop the capability to handle data-intensive computing tasks which in the traditional parallel computing will cause a bottleneck in distribution. However, Hadoop treat its nodes as big computing resources. This MapReduce is strengthened with the Hadoop Distributed File System with its triple data replication.

big data

In HDFS, when we copy a huge file in the HDFS file system, HDFS will break the files into chunk of data at the size of 64 MB, and the chunk is triplicated (replicated three times) to provide reliability. Then each of the chunks are distributed to various nodes in the Hadoop cluster. Since the data is replicated three times, therefore the 64 MB chunk is available on three independent nodes. Even though the chunk is physically distribute, all of our interaction with the file on HDFS looks like it is still the same single file we copied initially.  Hadoop Distributed File uses TCP/IP communication layer to communicate with its nodes.

MapReduce on top of the HDFS is a technology that gives us easy to use data distribution, replication, and automatic recovery. The process consists of two steps: Map and Reduce.

Map step is a step that transforms our input data into a series of key-value pairs to be analyzed in the Reduce step. During the Map step, data processing will search across multiple nodes and collecting the requested data and send the result to the Reduce step.

Reduce step is initiated after the data obtained in the Map step, then reducer sort the data  based on their key-value pairs and be given the reducer. The reducer then calculate the data that shares common key, where duplicated data is then sent back to HDFS and data is presented to the application requesting the data.

One Hadoop clusters consist of one master and multiple worker nodes. The master node has  JobTracker, TaskTracker, NameNode and DataNode. A worker node serves as DataNode and TaskTracker.

For a more comprehensive tutorial on HDFS and MapReduce, learning the advanced course of Hadoop will give us a hands on training of writing the MapReduce program and working with the Hadoop tools.


Python is a high-level programming language which focused on code readability as its philosophy. Python has a straightforward syntax which enable programmers to write less line of code compared to another programming language.

Udemy has an acclaimed Beginner Python course that has brought thousands of students success who want to learn about writing a Python program quickly and also creating custom modules and libraries.

Python was first initiated in 1989 by Guido van Rossum. Python has the capability to interface with Amoeba operating system and mange Amoeba exception handling as well. The philosophy of Python is a multi-paradigm programming language. It can be used for object-oriented, functional programming or procedural. Python has a dynamic type system and automatic memory management with comprehensive standard library.

If you want to know Python deeper, take the Ultimate Python Programming course where you can learn more about Python and understand all of its features. The courses also provide you with the verifiable certificate of completion for those who undertake this course.As an interpreter, the design of Python is aimed to be highly extensible, we can embed Python code in an existing application as programmable interface. Equipped with its PyPy, an interpreter and just-in-time compiler, Python is able to fulfill its extensibility design philosophy.

Learn more about extensibility of Python in Intermediate Python course at Udemy.

Hadoop Python

Hadoop is working well with Java, for every High Performance Computing needs in Java, Hadoop provides its solution. Hadoop also works well for C and C++. Hadoop provides every API needed to have distributed computing, and the API is distribnuted along with Hadoop. Unfortunately, Python needs a little adjustment to work on Hadoop.

In order to achieve HPC in Python, a sourceforge project called PyDoop has already initiated to provide Hadoop pipe application which written in Python. Using PyDoop, we can utilize the Hadoop standard library and utilities along wth Hadoop Distributed File System and Hadoop MapReduce. As a pipe application, the MapReduce API of Python is working as a wrapper for the Hadoop API, orginally written in C++.

With PyDoop, we can utilize the MapReduce in Hadoop with its map() and reduce() function and the Reducer class in Hadoop. When Python application initiated the mapper, Hadoop will instantiated the mapper based on its Python input and send the result to reducer that will of calculate key-value pairs given by mapper.

The following snippet is an example of enabling mapper and reducer class for a wordcount program:

#!/usr/bin/env python
import pydoop.pipes as pa
class Mapper(pa.Mapper):
  def map(self, context):
    words = context.getInputValue().split()
    for w in words:
      context.emit(w, "1")
class Reducer(pa.Reducer):
  def reduce(self, context):
    s = 0
    while context.nextValue():
      s += int(context.getInputValue())
    context.emit(context.getInputKey(), str(s))
if __name__ == "__main__":
  pa.runTask(pa.Factory(Mapper, Reducer))

Notice the line number 4 and 11, that we call the Mapper and Reducer class using the pydoop pipe application (pa).

In order to run a Hadoop-enabled Python program with Hadoop, execute the following command (replace the ProgramName with your program file name) :

hadoop fs -put ProgramName{,}

hadoop fs -put input{,}

hadoop pipes \

-D hadoop.pipes.java.recordreader=true \

-D hadoop.pipes.java.recordwriter=true \

-program ProgramName -input input -output output

The program will run using the MapReduce. Therefore, we can have our Python API for Hadoop for every time we need to run Python application with Hadoop.

Click here to learn more about Python for web programming which provides a hands-on guide to start your web project using Python.

Hadoop also provides a certification for Hadoop Developer, and you can enroll the course to become Hadoop Certified Developer with this Hadoop tutorial.