Explore Courses
Liverpool Business SchoolLiverpool Business SchoolMBA by Liverpool Business School
  • 18 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityMBA (Master of Business Administration)
  • 15 Months
Popular
O.P.Jindal Global UniversityO.P.Jindal Global UniversityMaster of Business Administration (MBA)
  • 12 Months
New
Birla Institute of Management Technology Birla Institute of Management Technology Post Graduate Diploma in Management (BIMTECH)
  • 24 Months
Liverpool John Moores UniversityLiverpool John Moores UniversityMS in Data Science
  • 18 Months
Popular
IIIT BangaloreIIIT BangalorePost Graduate Programme in Data Science & AI (Executive)
  • 12 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityDBA in Emerging Technologies with concentration in Generative AI
  • 3 Years
upGradupGradData Science Bootcamp with AI
  • 6 Months
New
University of MarylandIIIT BangalorePost Graduate Certificate in Data Science & AI (Executive)
  • 8-8.5 Months
upGradupGradData Science Bootcamp with AI
  • 6 months
Popular
upGrad KnowledgeHutupGrad KnowledgeHutData Engineer Bootcamp
  • Self-Paced
upGradupGradCertificate Course in Business Analytics & Consulting in association with PwC India
  • 06 Months
OP Jindal Global UniversityOP Jindal Global UniversityMaster of Design in User Experience Design
  • 12 Months
Popular
WoolfWoolfMaster of Science in Computer Science
  • 18 Months
New
Jindal Global UniversityJindal Global UniversityMaster of Design in User Experience
  • 12 Months
New
Rushford, GenevaRushford Business SchoolDBA Doctorate in Technology (Computer Science)
  • 36 Months
IIIT BangaloreIIIT BangaloreCloud Computing and DevOps Program (Executive)
  • 8 Months
New
upGrad KnowledgeHutupGrad KnowledgeHutAWS Solutions Architect Certification
  • 32 Hours
upGradupGradFull Stack Software Development Bootcamp
  • 6 Months
Popular
upGradupGradUI/UX Bootcamp
  • 3 Months
upGradupGradCloud Computing Bootcamp
  • 7.5 Months
Golden Gate University Golden Gate University Doctor of Business Administration in Digital Leadership
  • 36 Months
New
Jindal Global UniversityJindal Global UniversityMaster of Design in User Experience
  • 12 Months
New
Golden Gate University Golden Gate University Doctor of Business Administration (DBA)
  • 36 Months
Bestseller
Ecole Supérieure de Gestion et Commerce International ParisEcole Supérieure de Gestion et Commerce International ParisDoctorate of Business Administration (DBA)
  • 36 Months
Rushford, GenevaRushford Business SchoolDoctorate of Business Administration (DBA)
  • 36 Months
KnowledgeHut upGradKnowledgeHut upGradSAFe® 6.0 Certified ScrumMaster (SSM) Training
  • Self-Paced
KnowledgeHut upGradKnowledgeHut upGradPMP® certification
  • Self-Paced
IIM KozhikodeIIM KozhikodeProfessional Certification in HR Management and Analytics
  • 6 Months
Bestseller
Duke CEDuke CEPost Graduate Certificate in Product Management
  • 4-8 Months
Bestseller
upGrad KnowledgeHutupGrad KnowledgeHutLeading SAFe® 6.0 Certification
  • 16 Hours
Popular
upGrad KnowledgeHutupGrad KnowledgeHutCertified ScrumMaster®(CSM) Training
  • 16 Hours
Bestseller
PwCupGrad CampusCertification Program in Financial Modelling & Analysis in association with PwC India
  • 4 Months
upGrad KnowledgeHutupGrad KnowledgeHutSAFe® 6.0 POPM Certification
  • 16 Hours
O.P.Jindal Global UniversityO.P.Jindal Global UniversityMaster of Science in Artificial Intelligence and Data Science
  • 12 Months
Bestseller
Liverpool John Moores University Liverpool John Moores University MS in Machine Learning & AI
  • 18 Months
Popular
Golden Gate UniversityGolden Gate UniversityDBA in Emerging Technologies with concentration in Generative AI
  • 3 Years
IIIT BangaloreIIIT BangaloreExecutive Post Graduate Programme in Machine Learning & AI
  • 13 Months
Bestseller
IIITBIIITBExecutive Program in Generative AI for Leaders
  • 4 Months
upGradupGradAdvanced Certificate Program in GenerativeAI
  • 4 Months
New
IIIT BangaloreIIIT BangalorePost Graduate Certificate in Machine Learning & Deep Learning (Executive)
  • 8 Months
Bestseller
Jindal Global UniversityJindal Global UniversityMaster of Design in User Experience
  • 12 Months
New
Liverpool Business SchoolLiverpool Business SchoolMBA with Marketing Concentration
  • 18 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityMBA with Marketing Concentration
  • 15 Months
Popular
MICAMICAAdvanced Certificate in Digital Marketing and Communication
  • 6 Months
Bestseller
MICAMICAAdvanced Certificate in Brand Communication Management
  • 5 Months
Popular
upGradupGradDigital Marketing Accelerator Program
  • 05 Months
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Corporate & Financial Law
  • 12 Months
Bestseller
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in AI and Emerging Technologies (Blended Learning Program)
  • 12 Months
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Intellectual Property & Technology Law
  • 12 Months
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Dispute Resolution
  • 12 Months
upGradupGradContract Law Certificate Program
  • Self paced
New
ESGCI, ParisESGCI, ParisDoctorate of Business Administration (DBA) from ESGCI, Paris
  • 36 Months
Golden Gate University Golden Gate University Doctor of Business Administration From Golden Gate University, San Francisco
  • 36 Months
Rushford Business SchoolRushford Business SchoolDoctor of Business Administration from Rushford Business School, Switzerland)
  • 36 Months
Edgewood CollegeEdgewood CollegeDoctorate of Business Administration from Edgewood College
  • 24 Months
Golden Gate UniversityGolden Gate UniversityDBA in Emerging Technologies with Concentration in Generative AI
  • 36 Months
Golden Gate University Golden Gate University DBA in Digital Leadership from Golden Gate University, San Francisco
  • 36 Months
Liverpool Business SchoolLiverpool Business SchoolMBA by Liverpool Business School
  • 18 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityMBA (Master of Business Administration)
  • 15 Months
Popular
O.P.Jindal Global UniversityO.P.Jindal Global UniversityMaster of Business Administration (MBA)
  • 12 Months
New
Deakin Business School and Institute of Management Technology, GhaziabadDeakin Business School and IMT, GhaziabadMBA (Master of Business Administration)
  • 12 Months
Liverpool John Moores UniversityLiverpool John Moores UniversityMS in Data Science
  • 18 Months
Bestseller
O.P.Jindal Global UniversityO.P.Jindal Global UniversityMaster of Science in Artificial Intelligence and Data Science
  • 12 Months
Bestseller
IIIT BangaloreIIIT BangalorePost Graduate Programme in Data Science (Executive)
  • 12 Months
Bestseller
O.P.Jindal Global UniversityO.P.Jindal Global UniversityO.P.Jindal Global University
  • 12 Months
WoolfWoolfMaster of Science in Computer Science
  • 18 Months
New
Liverpool John Moores University Liverpool John Moores University MS in Machine Learning & AI
  • 18 Months
Popular
Golden Gate UniversityGolden Gate UniversityDBA in Emerging Technologies with concentration in Generative AI
  • 3 Years
Rushford, GenevaRushford Business SchoolDoctorate of Business Administration (AI/ML)
  • 36 Months
Ecole Supérieure de Gestion et Commerce International ParisEcole Supérieure de Gestion et Commerce International ParisDBA Specialisation in AI & ML
  • 36 Months
Golden Gate University Golden Gate University Doctor of Business Administration (DBA)
  • 36 Months
Bestseller
Ecole Supérieure de Gestion et Commerce International ParisEcole Supérieure de Gestion et Commerce International ParisDoctorate of Business Administration (DBA)
  • 36 Months
Rushford, GenevaRushford Business SchoolDoctorate of Business Administration (DBA)
  • 36 Months
Liverpool Business SchoolLiverpool Business SchoolMBA with Marketing Concentration
  • 18 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityMBA with Marketing Concentration
  • 15 Months
Popular
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Corporate & Financial Law
  • 12 Months
Bestseller
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Intellectual Property & Technology Law
  • 12 Months
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Dispute Resolution
  • 12 Months
IIITBIIITBExecutive Program in Generative AI for Leaders
  • 4 Months
New
IIIT BangaloreIIIT BangaloreExecutive Post Graduate Programme in Machine Learning & AI
  • 13 Months
Bestseller
upGradupGradData Science Bootcamp with AI
  • 6 Months
New
upGradupGradAdvanced Certificate Program in GenerativeAI
  • 4 Months
New
KnowledgeHut upGradKnowledgeHut upGradSAFe® 6.0 Certified ScrumMaster (SSM) Training
  • Self-Paced
upGrad KnowledgeHutupGrad KnowledgeHutCertified ScrumMaster®(CSM) Training
  • 16 Hours
upGrad KnowledgeHutupGrad KnowledgeHutLeading SAFe® 6.0 Certification
  • 16 Hours
KnowledgeHut upGradKnowledgeHut upGradPMP® certification
  • Self-Paced
upGrad KnowledgeHutupGrad KnowledgeHutAWS Solutions Architect Certification
  • 32 Hours
upGrad KnowledgeHutupGrad KnowledgeHutAzure Administrator Certification (AZ-104)
  • 24 Hours
KnowledgeHut upGradKnowledgeHut upGradAWS Cloud Practioner Essentials Certification
  • 1 Week
KnowledgeHut upGradKnowledgeHut upGradAzure Data Engineering Training (DP-203)
  • 1 Week
MICAMICAAdvanced Certificate in Digital Marketing and Communication
  • 6 Months
Bestseller
MICAMICAAdvanced Certificate in Brand Communication Management
  • 5 Months
Popular
IIM KozhikodeIIM KozhikodeProfessional Certification in HR Management and Analytics
  • 6 Months
Bestseller
Duke CEDuke CEPost Graduate Certificate in Product Management
  • 4-8 Months
Bestseller
Loyola Institute of Business Administration (LIBA)Loyola Institute of Business Administration (LIBA)Executive PG Programme in Human Resource Management
  • 11 Months
Popular
Goa Institute of ManagementGoa Institute of ManagementExecutive PG Program in Healthcare Management
  • 11 Months
IMT GhaziabadIMT GhaziabadAdvanced General Management Program
  • 11 Months
Golden Gate UniversityGolden Gate UniversityProfessional Certificate in Global Business Management
  • 6-8 Months
upGradupGradContract Law Certificate Program
  • Self paced
New
IU, GermanyIU, GermanyMaster of Business Administration (90 ECTS)
  • 18 Months
Bestseller
IU, GermanyIU, GermanyMaster in International Management (120 ECTS)
  • 24 Months
Popular
IU, GermanyIU, GermanyB.Sc. Computer Science (180 ECTS)
  • 36 Months
Clark UniversityClark UniversityMaster of Business Administration
  • 23 Months
New
Golden Gate UniversityGolden Gate UniversityMaster of Business Administration
  • 20 Months
Clark University, USClark University, USMS in Project Management
  • 20 Months
New
Edgewood CollegeEdgewood CollegeMaster of Business Administration
  • 23 Months
The American Business SchoolThe American Business SchoolMBA with specialization
  • 23 Months
New
Aivancity ParisAivancity ParisMSc Artificial Intelligence Engineering
  • 24 Months
Aivancity ParisAivancity ParisMSc Data Engineering
  • 24 Months
The American Business SchoolThe American Business SchoolMBA with specialization
  • 23 Months
New
Aivancity ParisAivancity ParisMSc Artificial Intelligence Engineering
  • 24 Months
Aivancity ParisAivancity ParisMSc Data Engineering
  • 24 Months
upGradupGradData Science Bootcamp with AI
  • 6 Months
Popular
upGrad KnowledgeHutupGrad KnowledgeHutData Engineer Bootcamp
  • Self-Paced
upGradupGradFull Stack Software Development Bootcamp
  • 6 Months
Bestseller
KnowledgeHut upGradKnowledgeHut upGradBackend Development Bootcamp
  • Self-Paced
upGradupGradUI/UX Bootcamp
  • 3 Months
upGradupGradCloud Computing Bootcamp
  • 7.5 Months
PwCupGrad CampusCertification Program in Financial Modelling & Analysis in association with PwC India
  • 5 Months
upGrad KnowledgeHutupGrad KnowledgeHutSAFe® 6.0 POPM Certification
  • 16 Hours
upGradupGradDigital Marketing Accelerator Program
  • 05 Months
upGradupGradAdvanced Certificate Program in GenerativeAI
  • 4 Months
New
upGradupGradData Science Bootcamp with AI
  • 6 Months
Popular
upGradupGradFull Stack Software Development Bootcamp
  • 6 Months
Bestseller
upGradupGradUI/UX Bootcamp
  • 3 Months
PwCupGrad CampusCertification Program in Financial Modelling & Analysis in association with PwC India
  • 4 Months
upGradupGradCertificate Course in Business Analytics & Consulting in association with PwC India
  • 06 Months
upGradupGradDigital Marketing Accelerator Program
  • 05 Months

Data Processing In Hadoop: Hadoop Components Explained [2024]

Updated on 22 November, 2022

11.82K+ views
8 min read

With the exponential growth of the World Wide Web over the years, the data being generated also grew exponentially. This led to a massive amount of data being created and it was being difficult to process and store this humungous amount of data with the traditional relational database systems.

Also, the data created was not only in the structured form but also in the unstructured format like videos, images, etc. This kind of data cannot be processed by relational databases. To counter these issues, Hadoop came into existence.

Before we dive into the data processing of Hadoop, let us have an overview of Hadoop and its components. Apache Hadoop is a framework that allows the storing and processing of huge quantities of data in a swift and efficient manner. It can be used to store huge quantities of structured and unstructured data. Learn more about hadoop ecosystem and components.

The pivotal building blocks of Hadoop are as follows: – 

Building Blocks of Hadoop

1. HDFS (The storage layer)

As the name suggests, Hadoop Distributed File System is the storage layer of Hadoop and is responsible for storing the data in a distributed environment (master and slave configuration). It splits the data into several blocks of data and stores them across different data nodes. These data blocks are also replicated across different data nodes to prevent loss of data when one of the nodes goes down.

It has two main processes running for processing of the data: –

a. NameNode

 It is running on the master machine. It saves the locations of all the files stored in the file system and tracks where the data resides across the cluster i.e. it stores the metadata of the files. When the client applications want to make certain operations on the data, it interacts with the NameNode. When the NameNode receives the request, it responds by returning a list of Data Node servers where the required data resides.

b. DataNode

This process runs on every slave machine. One of its functionalities is to store each HDFS data block in a separate file in its local file system. In other words, it contains the actual data in form of blocks. It sends heartbeat signals periodically and waits for the request from the NameNode to access the data.

2. MapReduce (The processing layer)

It is a programming technique based on Java that is used on top of the Hadoop framework for faster processing of huge quantities of data. It processes this huge data in a distributed environment using many Data Nodes which enables parallel processing and faster execution of operations in a fault-tolerant way.

A MapReduce job splits the data set into multiple chunks of data which are further converted into key-value pairs in order to be processed by the mappers. The raw format of the data may not be suitable for processing. Thus, the input data compatible with the map phase is generated using the InputSplit function and RecordReader.

InputSplit is the logical representation of the data which is to be processed by an individual mapper. RecordReader converts these splits into records which take the form of key-value pairs. It basically converts the byte-oriented representation of the input into a record-oriented representation.

These records are then fed to the mappers for further processing the data. MapReduce jobs primarily consist of three phases namely the Map phase, the Shuffle phase, and the Reduce phase.

a. Map Phase

It is the first phase in the processing of the data. The main task in the map phase is to process each input from the RecordReader and convert it into intermediate tuples (key-value pairs). This intermediate output is stored in the local disk by the mappers.

The values of these key-value pairs can differ from the ones received as input from the RecordReader. The map phase can also contain combiners which are also called as local reducers. They perform aggregations on the data but only within the scope of one mapper.

As the computations are performed across different data nodes, it is essential that all the values associated with the same key are merged together into one reducer. This task is performed by the partitioner. It performs a hash function over these key-value pairs to merge them together.

It also ensures that all the tasks are partitioned evenly to the reducers. Partitioners generally come into the picture when we are working with more than one reducer.

b. Shuffle and Sort Phase

This phase transfers the intermediate output obtained from the mappers to the reducers. This process is called as shuffling. The output from the mappers is also sorted before transferring it to the reducers. The sorting is done on the basis of the keys in the key-value pairs. It helps the reducers to perform the computations on the data even before the entire data is received and eventually helps in reducing the time required for computations.

As the keys are sorted, whenever the reducer gets a different key as the input it starts to perform the reduce tasks on the previously received data.

c. Reduce Phase

The output of the map phase serves as an input to the reduce phase. It takes these key-value pairs and applies the reduce function on them to produce the desired result. The keys and the values associated with the key are passed on to the reduce function to perform certain operations.

We can filter the data or combine it to obtain the aggregated output. Post the execution of the reduce function, it can create zero or more key-value pairs. This result is written back in the Hadoop Distributed File System. 

3. YARN (The management layer)

Yet Another Resource Navigator is the resource managing component of Hadoop. There are background processes running at each node (Node Manager on the slave machines and Resource Manager on the master node) that communicate with each other for the allocation of resources. The Resource Manager is the centrepiece of the YARN layer which manages resources among all the applications and passes on the requests to the Node Manager.

The Node Manager monitors the resource utilization like memory, CPU, and disk of the machine and conveys the same to the Resource Manager. It is installed on every Data Node and is responsible for executing the tasks on the Data Nodes.

The entire workflow for data processing on Hadoop can be summarised as follows: – 

  • InputSplit; logically splits the data which resides on HDFS into several blocks of data. The decision on how to split the data is done by the Inputformat.
  • The data is converted into key-value pairs by RecordReader. RecordReader converts the byte-oriented data to record-oriented data. This data serves as the input to the mapper.
  • The mapper, which is nothing but a user-defined function processes these key-values pairs and generates intermediate key-value pairs for further processing.
  • These pairs are locally reduced (within the scope of one mapper) by the combiners to reduce the amount of data to be transferred from the mapper to the reducer.
  • Partitioner ensures that all the values with the same key are merged together into the same reducer and that the tasks are evenly distributed amongst the reducers.
  • These intermediate key-value pairs are then shuffled to the reducers and sorted on the basis of keys. This outcome is fed to the reducers as input.
  • The reduce function aggregates the values for each key and the result is stored back into the HDFS using RecordWriter. Before writing it back to the HDFS, the format in which the data should be written is decided by the Outputformat.

If you are interested to know more about Big Data, check out our Advanced Certificate Programme in Big Data from IIIT Bangalore.

Learn Software Development Courses online from the World’s top Universities. Earn Executive PG Programs, Advanced Certificate Programs or Masters Programs to fast-track your career.

Frequently Asked Questions (FAQs)

1. Between Hadoop and MapReduce, which one is a better choice?

Hadoop uses a storage framework to store data. Moreover, it also helps in creating name nodes and data nodes. Apache Hadoop is built on software that makes data distribution and processing a hassle-free task. It uses simple programming to conduct all its operations related to data. Furthermore, it also integrates with MapReduce. On the other hand, MapReduce is mainly a programming-oriented framework that allows the sorting and processing of data using key-value pairs. Its programming model is generally used to implement, generate, and process big data sets that work on a distributed algorithm. Hadoop is open-source and its clusters are scalable. MapReduce offers high availability and fault tolerance. MapReduce works on Java programming language, whereas Hadoop uses multiple programming languages depending on the module.

2. What is the hardware configuration of Namenode and Datanode?

The hardware configuration of a node depends on a number of factors and varies from one node to another. Depending on the extensive use of clusters, the configurations are designed accordingly. The Namenode configuration uses 2 Quad-Core CPUs running at 2 GHZ processors with an in-built RAM of 128 GB. It operates on 10 GB Ethernet and has a disk space of 6 TB Serial ATA. Datanode also uses 2 Quad-Core CPUs running at 2 GHZ processors with an in-built RAM of 64 GB. It operates on 10 GB Ethernet and has a disk space of 24 TB Serial ATA.

3. What are some of the techniques for MapReduce job optimization?

First and foremost, proper cluster configuration is necessary to improve input-output performance. It is also important to keep a cursory check on the graphs, network usage reports, and performance metrics. Plus, the hard drive needs to be constantly monitored to analyze their health. LZO compression usage is another great technique for MapReduce job optimization wherein the LZO will benefit from the map outfit that Hadoop jobs will create. LZO could be a trouble for the CPU, but it uses other reduction techniques that fit well.