Analytics Redefined
Big Data

What is Hadoop? Introduction to Hadoop Ecosystem and Architecture | Data Cloud School

What is Hadoop? - www.datacloudschool.com


Hey there folks ✋! In this article, I will be covering the essential basics of Big Data, Hadoop, that I believe all should know. We are today discussing  "What is Hadoop ?". Many have the wrong idea of Hadoop and I will discuss that in this article. Hadoop is regarded as the breakthrough in the Big Data  field and you will know why after reading this article

So today we are going to discuss the following topics :


WHAT IS HADOOP?


Whenever we discuss Big Data, we always associate "Hadoop" along with it. It's because Hadoop is what made Big Data Processing accessible to all. Hadoop was the breakthrough that happened in the Data Analytics sphere and it made this sphere democratic!

Hadoop is a package of different Frameworks and solutions that work together to help crunch Big Data. It would be right to consider it as an ecosystem rather than a tool or framework.

So this Hadoop Ecosystem consists of various solutions/frameworks/layers which are specialized in various domains of data processing. Some of them being Mapreduce Framework which provides a methodology to process big data efficiently, HDFS as a storage/ file system of Hadoop which helps store huge chunks of data in a Distributed manner, YARN for Resource negotiation across the Cluster, etc to mention a few

So the important takeaways from what we discussed so far are :
  • Hadoop is an ecosystem of tools and Frameworks for Big Data Processing
  • Hadoop was the first breakthrough which made Big Data Processing accessible to all
  • Hadoop is not Mapreduce . The later is just a part of Hadoop Ecosystem
  • Hadoop consists of various components like Mapreduce, HDFS, Yarn etc
Now that I hope that you are clear till now, let's move forward with a little history! Shall We ?!

Now, I can understand if some of you skip the history part. But let me tell you, these were moments in history that changed the way we live today! The efforts of the developers who worked relentlessly day and night, whose efforts democratized Big Data Analytics to us, deserves to be known and read about!

HISTORY OF HADOOP


History of Hadoop - www.datacloudschool.com


Hadoop was developed by Dough Cutting, the creator of Apache Lucene ( A widely used text search library). Dough Cutting along with Mike Cafarella went on to start a new sub-project under Lucene called Nutch.
Nutch is an Open Source web search engine. Basically, it's a web crawler, that we are familiar with these days, which crawls billions of web pages every day and try index those in search engines

Soon, they realized that this task is not going to be easy as

  • A web crawler algorithm is really complex to code
  • Infrastructure and staff expenses are going to be huge
Infact, Cutting and Cafarella estimated a system supporting one billion page index would cost them approximately $500,000 in hardware and a monthly running expense of $30,000
But, with a worthy goal of democratizing search engine algorithms, the Nutch Project was started in 2002
Hiccups are part of any innovation, Nutch was no different. Soon, they realized that their architecture wouldn't scale up to billions of web pages! But then, in 2003 Google published a whitepaper on GFS (Google File System) which was used in production at Google. This whitepaper explained the architecture of Google's Distributed File System GFS.

Nutch developers identified this to solve Nutch's storage problem for indexing billions of web pages and soon developed an Open Source implementation of GFS, the Nutch Distributed File System (NDFS). 

In 2004, Google published another whitepaper that introduces MapReduce to the world! By the end of 2005, Nutch developers had developed a working model of Mapreduce and moved most of the Nutch's algorithms to MapReduce and NDFS.

In February 2006, Dough Cutting got into Yahoo and they moved out of Nutch to form an independent project under Lucene called Hadoop. Yahoo provided all the resources and a dedicated team to convert Hadoop to a web search engine that ran web-scale!
In February 2008, Yahoo announced that its search engine in production is running on a 10,000 core Hadoop Cluster !
Hadoop was made a top-level project at Apache by January 2008. Since then many Organisations have been using Hadoop for processing huge data using it's simple and exceptional parallel processing capabilities

AN OVERVIEW OF HADOOP ECOSYSTEM


Hadoop Ecosystem or Hadoop as we know today is a collection of different frameworks and tools that work together to store, process and analyze big data.

hadoop ecosystem-datacloudschool.com


This is the description from the Official Apache Hadoop page:
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures
 The important catch points from the above explanation of Hadoop are :

  • Hadoop uses Distributed Processing across clusters* of computers (commodity hardware)  for processing data
  • Hadoop  uses simple programming models like MapReduce
  • Hadoop  can scale up to thousands of machines
    • These machines are commodity hardware ( normal computers just like the one you are using ! ), which may fail during the processing period
  • Hadoop can handle such failures at the application layer itself 

* A cluster is nothing but a group or collection of computers connected to each other via cables and switches!

Main modules of the Apache Hadoop project are :

  • Hadoop Common: The common utilities that support the other Hadoop modules.
  • Hadoop Distributed File System (HDFS™): A distributed File System for Hadoop. It is just like the NTFS which is the file system for Windows machines. The difference is just that File Systems like NTFS can be installed or used only for a single machine, while HDFS is for installation on clusters of computer i.e., distributed in nature!
  • Hadoop YARN: A framework for job scheduling and cluster resource management.
  • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
  • Hadoop Ozone: An object store for Hadoop.
  • Hadoop Submarine: A machine learning engine for Hadoop.

I know, it can be quite overwhelming after seeing the above modules. But don't worry, you don't want to know about all of the above modules. Our main focus points will be on HDFS, YARN, and MapReduce, which you should know about for sure, if you are into data analytics path

Apart from the above modules, Hadoop Ecosystem is also made up of other related projects. Some of the projects in the Hadoop Ecosystem are shown in the above image. There are other projects also getting added to the Hadoop Ecosystem.

In the next section, we will continue to explore some of the projects in the Hadoop Ecosystem by going through the architectural image, which would give you a holistic understanding of each project based on their role in the architecture

THE HADOOP ECOSYSTEM ARCHITECTURE



Hadoop Ecosystem Architecture | www.datacloudschool.com

To understand the architectural representation of the Hadoop Ecosystem, we would probably start from Storage. HDFS is the storage solution for distributed data in the Hadoop Ecosystem. No, it is not a database! Rather it is a File System. Let us try to understand what is HDFS.

What is HDFS in Hadoop?


Hadoop Distributed File System or HDFS is the file system for clusters. It is the solution to store large datasets on a cluster.

Basically, you might have heard of NTFS in Windows PC's. File systems like NTFS are designed to be configured only for a single machine. But in the case of distributed computing, we needed a file system that can help store the data across thousands of nodes in a cluster.

HDFS spans across multiple nodes in a cluster , but to a developer it presents itself as if you would access a folder in a computer. Behind the scene, data is stored across multiple computers in a cluster, but to you its virtually like accessing single system !

Now, that's a powerful and elegant component. It solved the problem of storing huge amounts of data, which otherwise seemed to be impossible to store in a single hard disk. Now, with the help of HDFS, we can store huge data in a cluster, by spreading the data across multiple nodes in the cluster!

Again, we will have an in-depth article on HDFS  where you will get a better understanding of HDFS
Moving above the architectural diagram, you will find two components sitting above HDFS, Hadoop YARN, and Apache HBase. As we already discussed, HDFS is the File System for Hadoop and not a database. A database will have Tables, Views, etc while a File System will have Directories /Folder, file, etc. Apache HBase is the database in Hadoop. So let's try to understand Hbase in Hadoop.

Introduction to HBase in Hadoop


  • HBase is the NoSQL columnar-database solution in the Hadoop Ecosystem.
  • HDFS lacks in the ability to provide Random Access to the data. For accessing a small chunk of information in HDFS, one has to traverse through the entire data sequentially. 
  • HBase addresses this issue by providing a columnar approach, that allows us to randomly access data from HDFS in near real-time latency.
  • Thus HBase provides random access as well as near-real-time access to data which is a major problem in HDFS.

What is YARN in Hadoop?


  • YARN or Yet Another Resource Negotiator is in definition a framework for job scheduling and cluster resource management.

    Now, MapReduce is just a programming methodology or steps to process big data. But to process huge amount of data in a distributed environment require resources like CPU, RAM, and whatnot. MapReduce doesn't handle these. Its the duty of YARN to negotiate or arrange the required resources for efficient parallel processing of Big Data as the name suggests.
  • YARN will calculate how much RAM, CPU cores and nodes need to be assigned to a MapReduce job for the data to be processed most efficiently. After it gathers this information, it tries to assign the required resources, else it will assign what's best possible!
YARN and MapReduce have to work together to process and analyze big data
  • Earlier in MapReduce Version 1(MR1), MapReduce itself was responsible for the resource negotiation. But with MR2, MapReduce is only a programming model for data processing, while YARN handles the resource negotiation.
  • YARN is a generic framework, that can also work with other non-MapReduce applications like Apache Spark, etc. 
Now, since we are familiar with the Storage and Resource Management solutions of the Hadoop Ecosystem, let's move up the diagram to the Data Processing solutions.

This is a very active layer in the Hadoop Ecosystem and one of the most important ones too! New frameworks are getting introduced to this layer every year, each solving different issues and improvising on the older frameworks.

Hadoop MapReduce was the first to be added to this layer, followed by Apache Tez and now, Apache Spark. Apache Spark is leading this layer in current industry scenarios with its innovative and state of the art programming model.

A Gentle Overview to Hadoop MapReduce


"Hadoop Mapreduce", this is quite a common term associated with Big Data. I'm sure, even if you have done some research into Big Data, Hadoop Mapreduce will be one of the first important terms that you come across.

Many confuse the terms Hadoop to MapReduce. They are often regarded as the same among beginners and novices to Big Data. I want you to not have this Idea at all!

MapReduce is a part of the Hadoop Ecosystem and not Hadoop itself!

Hadoop MapReduce is a programming methodology or framework for implementing Parallel processing of huge datasets or Big Data. In simple terms, Hadoop MapReduce is the blueprint for processing huge datasets in a parallel/distributed fashion!

Now the main catch point is that Hadoop MapReduce helps to process data in a distributed environment. To understand this better, let me give you a foolish but useful example!

Let us think that you are a baby ant. You are just starting with your first baby steps. Now, Ants always work as a team or a group and not as an individual. But someone needs to teach you that first! Come on! You are just a little baby!

Now, your mother Ant teaches you how to work as a team to achieve more, than if you would have otherwise done individually. 

In the above scenario, Your mother can be considered as Hadoop and advice that she gave you to follow to learn teamwork is MapReduce! Here TEAMWORK represents Distributed Computing, the advice or methodology your mother provided you represents MAP-REDUCE

MapReduce can be programmed in various languages, but Java is most commonly used for this purpose.  I'm not going too deep into MapReduce details since this is just an overview. We will discuss Mapreduce in-depth in the coming articles!

Basically, MapReduce consists of two phases :
  • Map Phase

    In this phase, the data is being prepared for the reduce phase. MapReduce works in the concept of Key-Value as the name Map suggest ( If you are aware of Map in Java or dictionary in Python ).
    In Map-Phase the data is arranged as key-value pairs and this is done through some internal processes, which would be an overkill to discuss in this article. This Key-Value pair is the output to the Reduce Phase

  • Reduce Phase

    Reduce phase, basically, the final aggregation logic is applied to the Map Phase output to calculate the final desired result
If all the above information just went over your head, don't worry! We have a dedicated article for MapReduce queued up already in our pipeline!

If you even understood a gist of what I have tried to explain, then that's great! 

Apache Spark in Big Data


  • Apache Spark is the current industry leader among the big data processing frameworks. 
  • It is based on the concept of in-memory computation where instead of writing intermediate processing results to the Disk, Spark makes use of RAM memory to store intermediate results.
    • Since the I/O speed of RAM is way faster than the I/O speed of Disk, a significant speed boost of  100x is achieved in Apache Spark when compared to MapReduce.

      apache spark vs Hadoop Mapreduce | www.datacloudschool.com
    • As shown above, In Spark, you are saving a lot of Slow Disks I/O by making use of the distributed RAM memory of the cluster.
  • Apache Spark is based on the concepts of RDD and Dataframes which are similar to a dataframe in Pandas. 
  • Apache spark provides several functions which are separated into two categories: Transformations and Actions
    • Transformations are those functions that transform the data into another form like the filter() transformation. As the name suggests,  it filters the data based on some conditions.
    • Actions are those functions that return some results by aggregating the data when applied. Some examples are count(), sum(), etc.
  • Apache Spark also uses the concept of Lineage Graphs to attain fault tolerance.
    • It remembers the transformations that are applied to a data (or dataframe) by forming a lineage graph. Whenever there occurs and error, it just has to trace back the lineage graph.
  • Another concept that Apache Spark uses is called Lazy Evaluation. In this concept, Spark actually does not perform any transformations on the data until an Action is called.
    • So, it basically keeps on building the lineage graph for all the transformations you are performing.
    • It is only when you invoke an Action, Apache Spark starts to execute the transformations in the lineage graph. after performing some internal optimizations through Spark's optimization engine.
  • Apache Spark has API's in most of the popular languages like Python, Scala, R, etc
  • Apache Spark is a unified analytics engine, where it provides libraries for SQL functionalities through Spark SQL and Dataframes, Machine Learning capabilities through Spark MLlib, Graph Processing capabilities through Spark GraphX, stream processing capabilities through Spark Streaming
    • The striking point is that you can combine these libraries into the same application seamlessly
Now, I know that its a lot of information to digest if you are a beginner, but this gives you a good heads up to other in-depth articles which we will be posting here!

Let's move on one level up again to the Analytics layer. This layer consists mainly of two components Apache Hive and Pig. Both provide data querying functionality using SQL like syntax.

Overview of Apache Hive


  • Apache Hive is a data warehousing solution in Hadoop
  • It was developed at Facebook for enabling data analytics for its data scientists who were not expert Java programmers to write MapReduce code.
  • User can query, analyze and process data using SQL like syntax avoiding any sort of complex programming
  • A structure can be defined for files already existing in the HDFS
  • It is not a database. Rather Apache Hive is a data warehousing component.
  • Hive stores all its metadata details inside a relational database like MySQL and its called a Metastore.
  • Hive also has a main Warehouse directory residing in HDFS where all its data is stored.
  • Inside the warehouse directory, contains data in regular Directory-File format
  • It has a database like concepts integrated into it like Database, Table, View, etc but all these are just folders or files residing in a specific order inside a warehouse directory in HDFS
    • For instance, whenever you create a database in Hive, a folder is created in the Warehouse directory with the name <<database_name>>.db
    • A table inside a database would be a folder inside this .db directory
    • The data of a table will be a file of a specified format inside the table directory.
      • While creating the table in Hive, we can specify the file on which the table is being created. The file can be of any format like CSV, txt, dat, parquet, etc
  • Hive uses a language called HiveQL which is very similar to regular SQL.
  • Hive can run on three execution engines: Apache Spark, Apache Tez, and Hadoop MapReduce
  • Behind the scene, the hive compiler will convert the SQL query to a job of any of the three execution engine formats.
  • So, essentially the user just has to provide the SQL, the SQL parsing, Job creation and scheduling in either Apache Spark, MapReduce or Tez are all handled by Hive and the user is presented with the results in a tabular format.

Introduction to Pig


  • Similar to Apache Hive, Pig provides an abstraction over the Hadoop MapReduce to help programmers write less code to perform MapReduce jobs.
  • Pig provides a Procedural data flow language called Pig Latin
200 lines of MapReduce code can be written with 10 lines of Pig Latin.
  • Behind the scenes, Pig compiler converts Pig Latin script to MapReduce jobs
  • Pig is more suitable for integrating into programming while Apache Hive provides more of a reporting or Adhoc query processing functionality.
  • Pig helps programmers save time by writing less code by scripting in Pig Latin while hive helps analysts and Data Scientists to easily create reports and summarize huge datasets.
  • Apache Pig can work with any form of data, structured,semi-structured or unstructured.
  • Though both Apache Pig and Apache Hive falls into the same layer in the Hadoop ecosystem, both have different use cases.

Overview of Zookeeper


Apache Zookeeper is a coordination service for distributed systems. For an application to run in a cluster in a distributed manner, synchronization and proper communication between different nodes of the system are quintessential.

Apache Zookeeper basically acts as a service inside a distributed environment that handles all the communications and information required in a cluster to attain proper coordination and synchronization of different nodes while executing an application

Few services provided by Zookeeper are

  • Naming service: This service identifies the different nodes in a cluster by name. Similar to DNS.
  • Locking and synchronization service: Zookeeper locks data for modification while it is used by a node.
  • Configuration management: The latest and up-to-date configuration information of the system for a joining node.
  • Cluster management: Real-time updates of all the cluster information like Joining/Leaving of a node etc
  • Highly reliable data registry: Availability of data even when one or a few nodes are down
  • Leader election: Zookeeper performs a leader node election for synchronization purposes. It also handles the re-election of the Leader node when a leader node goes down.

Basically, Apache Zookeeper is like a choir conductor who controls and coordinates a choir.

Oozie - An Introduction


Apache Oozie is a workflow scheduler for running jobs in a Hadoop Distributed Environment. Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions. 

Oozie helps to create a pipeline of different jobs to achieve a bigger goal. For Example, using Oozie, you can create a workflow where you tell Oozie to run a Hive script at first, followed by a MapReduce Job and again followed by a Python Script. The MapReduce job depends on a table created by the Hive Script and the final python script requires both MapReduce and Hive Scripts to be successful.

So, as you can see, Oozie helped you to create a pipeline for achieving you goal instead of you manually running the Hive Script, then executing the Mapreduce and Python scripts.

Oozie can also help you to schedule a job to be run a later point of time. It can rerun a failed job from where it got failed. 

Basically, Oozie helps in Workflow Orchestration and hence it is tightly coupled to the Hadoop Ecosystem


Now, we have completed most of the essential components of a Hadoop Ecosystem except the bottom layer. That is Sqoop and Apache Flume.

Both these components help Hadoop to connect to data sources outside of the Hadoop ecosystem.  Sqoop helps import and export data between Hadoop and traditional RDBMS while Apache Flume, on the other hand, provides a mechanism to transfer huge amounts of streaming data like Log, events, etc from various sources to Hadoop.

So, both Flume and Sqoop help Hadoop in data transfer from external sources.

Conclusion


We have discussed quite a lot of information regarding Hadoop and its ecosystem in this article. I have tried my best to keep it simple, trying not to overwhelm you!

By now, you should be able to define what is Hadoop, identify a Big Data scenario, identify different components of a Hadoop Ecosystem, choose the right Hadoop component for a use case.

Its quite a long article and a lot of basic information regarding Hadoop. You might not be able to digest the entire information at one go. Please read again and try to visualize the scenarios with the help of the images provided in the article.

We discussed various important components of the Hadoop Ecosystem like Apache Spark, Apache Hive, Sqoop, Flume, Zookeeper, Oozie, etc . A basic understanding of these components is quintessential for a good journey ahead.

So, what's your thought about this article! Let me know in the comments below! I would be glad to improve this article based on your suggestions as this is a very important topic

Till next time, I wish you the best in your journey to becoming a proficient data guru! 

Related Posts

No comments:

Post a Comment