Hadoop Ecosystem and Big Data

hadoop ecosystemBefore we can understand the Hadoop ecosystem we need to explore Big Data and how it fits within the hadoop framwork. In 1955, an economist  Cyril Northcote Parkinson wrote an essay in the Economist saying “ “work expands so as to fill the time available for its completion”. This is known as Parkinson Law which was implemented in many areas, meaning that everything will fill up the space given to it. In the computer world, the Parkinson law is interpreted as “data expands to fill the space available for storage.” We have  only encountered the data of the size of terabyte (TB) and petabyte (PB)  in this century alone. Within 5-10 years ago, we rarely see a storage that reach size of terabyte. However, due to the advancement of storage media, we have reach the size of petabyte data even in exabyte size, while terabyte is the size of data which we weal with everyday.

Get a comprehensive introduction to Hadoop here.

In 2001, Gartner, an American information technology research and advisory firm, coined the term Big Data. This is the kind of data which 14 years ago the researchers at Gartner had predicted to be dealt with in everyday business process. Big data is defined as collection of data sets that so large and complex which making it difficult to process using on-hand database management tools or traditional data processing applications. The time is near for the new database to arise to replace tabular model of data storage in traditional database management system and to change SQL syntax into new form of data manipulation. Due to its form and size, Gartner mentioned that “Big Data required new forms of processing to enable enhanced decision making, insight discovery and process optimization”.

In that 2001 report, Gartner mentioned that Big Data created the massive data growth challenges and opportunities are three-dimensional, which are increasing volume (the amount of data), velocity (the speed of data transfer), and variety (the range of data types and sources). They are known as the 3-Vs: Volume, Velocity, Variety. Volume speaks about the increasing size of data from megabyte to gigabyte, terabyte and petabyte to exabyte. While Velocity speaks of the speed of data transfer in bit rate and measured in bits per second which in today’s world is measured in megabits (million bits). Whilst Variety speaks of the various data type which need to be stored, consist of the traditional structured data and unstructured data which makes up to 80% of the business data in the form of image, audio, video and document.

Learn all about Big Data with courses from Udemy.

hadoop ecosystemIf you have encountered Big Data challenges in your operation, it is a good idea to enroll for the course of Developing Big Data Strategy. You can learn what Big Data is, and about the most important big data trends are that affect a Big Data strategy, dealing with privacy, ethics and security and some of the most important big data technologies available. In the end you will have learned how to develop a big data strategy and start with changing your organization into an information-centric company.

Big Data also brought change to computing process. Traditionally, computing process bound on the processor as it central processing unit. For a long years that span many decades, computer was built to increase its RAM, faster processor and higher clockspeed. With the arrival of Big Data, the computing is now more than processor. The processor as a central processing will experience a bottleneck when dealing with Big Data. A current processor transfer rate of 75 MB/second will make transferring 100 GB of file to take up 22 minutes. Therefore, computer is now all about managing data.

Then Hadoop began.

Hadoop and Its Ecosystem
Hadoop is an open source framework for processing large amount of data in batches. Hadoop is created for pipelining massive amount of data for data processing to achieve excellent end result. The idea of Hadoop is to provide a cost-efficient High Performance Computing using the cloud infrastructure. Hadoop is an Apache top level project, and licensed under Apache 2.0

To learn the basics of Hadoop, check out this Hadoop Essential Training  and learn the fundamental principles behind Hadoop, and how you can use the power of Hadoop to make sense of your Big Data.

As a system that allows Big Data Management to be managed in the commodity hardware that work simultaneously in parallel computing environemt, Hadoop must be able to be fault tolerant, meaning that it has to be able to continue operating properly in the event of the failure of some of its components. Built in a modular approach, core component of Hadoop consists of:

  • Hadoop Common – libraries and utilities that provides common functionality of Hadoop
  • Hadoop Distributed File System (HDFS)  –  a distributed file-system that stores data on multiple machines in the cluster. HDFS is made to be fault tolerant and capable of running on commodity hardware
  • Hadoop MapReduce – a component model for large scale data processing in a parallel manner.

Those three are the core components which build the foundation of 4 layers of Hadoop Ecosystem. Right now, there is a large number of ecosystem was build around Hadoop which layered into the following:

Hadoop Ecosystem

  • DataStorage Layer
    This is where the data is stored in a distributed file system, consist of HDFS and HBase ColumnDB Storage. HBase is scalable, distributed database that supports structured data storage for large tables.
  • Data Processing Layer
    Is where the scheduling, resource management and cluster management to be calculated here. YARN job scheduling and cluster resource management with Map Reduce are located in this layer.
  • Data Access Layer
    This is the layer where the request from Management layer was sent to Data Processing Layer. Some projects have been setup for this layer, Some of them are: Hive, A data warehouse infrastructure that provides data summarization and ad hoc querying; Pig, A high-level data-flow language and execution framework for parallel computation; Mahout, A Scalable machine learning and data mining library; Avro, data serialization system.
  • Management Layer
    This is the layer that meets the user. User access the system through this layer which has the components like: Chukwa, A data collection system for managing large distributed system and ZooKeeper, high-performance coordination service for distributed applications.

In order to learn more about existing and upcoming Hadoop Ecosystem, we suggest you to take the Hadoop Machine Learning and Hadoop Eco System course, which you can learn of integrating Hadoop into the Enterprise Workflow, and some of the workflow components in the ecosystem’s layer like Machine Learning & Mahout, Hadoop Eco System Projects, HIVE, PIG, Oozie

Hadoop also has issued its certification, from certification exam from Cloudera: Cloudera Certified Hadoop Developer (CCD-410.). If you’re looking to be a Hadoop Certifed developer, this Hadoop Certification course may be just what you need. In this course, you will receive a high quality video with animations and pictorial representations to ease the understanding when learning, have the chance to write your own code in the Map Reduce environment, and have the hands-on experience of working on the Hadoop cluster.