Explore Courses
Liverpool Business SchoolLiverpool Business SchoolMBA by Liverpool Business School
  • 18 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityMBA (Master of Business Administration)
  • 15 Months
Popular
O.P.Jindal Global UniversityO.P.Jindal Global UniversityMaster of Business Administration (MBA)
  • 12 Months
New
Birla Institute of Management Technology Birla Institute of Management Technology Post Graduate Diploma in Management (BIMTECH)
  • 24 Months
Liverpool John Moores UniversityLiverpool John Moores UniversityMS in Data Science
  • 18 Months
Popular
IIIT BangaloreIIIT BangalorePost Graduate Programme in Data Science & AI (Executive)
  • 12 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityDBA in Emerging Technologies with concentration in Generative AI
  • 3 Years
upGradupGradData Science Bootcamp with AI
  • 6 Months
New
University of MarylandIIIT BangalorePost Graduate Certificate in Data Science & AI (Executive)
  • 8-8.5 Months
upGradupGradData Science Bootcamp with AI
  • 6 months
Popular
upGrad KnowledgeHutupGrad KnowledgeHutData Engineer Bootcamp
  • Self-Paced
upGradupGradCertificate Course in Business Analytics & Consulting in association with PwC India
  • 06 Months
OP Jindal Global UniversityOP Jindal Global UniversityMaster of Design in User Experience Design
  • 12 Months
Popular
WoolfWoolfMaster of Science in Computer Science
  • 18 Months
New
Jindal Global UniversityJindal Global UniversityMaster of Design in User Experience
  • 12 Months
New
Rushford, GenevaRushford Business SchoolDBA Doctorate in Technology (Computer Science)
  • 36 Months
IIIT BangaloreIIIT BangaloreCloud Computing and DevOps Program (Executive)
  • 8 Months
New
upGrad KnowledgeHutupGrad KnowledgeHutAWS Solutions Architect Certification
  • 32 Hours
upGradupGradFull Stack Software Development Bootcamp
  • 6 Months
Popular
upGradupGradUI/UX Bootcamp
  • 3 Months
upGradupGradCloud Computing Bootcamp
  • 7.5 Months
Golden Gate University Golden Gate University Doctor of Business Administration in Digital Leadership
  • 36 Months
New
Jindal Global UniversityJindal Global UniversityMaster of Design in User Experience
  • 12 Months
New
Golden Gate University Golden Gate University Doctor of Business Administration (DBA)
  • 36 Months
Bestseller
Ecole Supérieure de Gestion et Commerce International ParisEcole Supérieure de Gestion et Commerce International ParisDoctorate of Business Administration (DBA)
  • 36 Months
Rushford, GenevaRushford Business SchoolDoctorate of Business Administration (DBA)
  • 36 Months
KnowledgeHut upGradKnowledgeHut upGradSAFe® 6.0 Certified ScrumMaster (SSM) Training
  • Self-Paced
KnowledgeHut upGradKnowledgeHut upGradPMP® certification
  • Self-Paced
IIM KozhikodeIIM KozhikodeProfessional Certification in HR Management and Analytics
  • 6 Months
Bestseller
Duke CEDuke CEPost Graduate Certificate in Product Management
  • 4-8 Months
Bestseller
upGrad KnowledgeHutupGrad KnowledgeHutLeading SAFe® 6.0 Certification
  • 16 Hours
Popular
upGrad KnowledgeHutupGrad KnowledgeHutCertified ScrumMaster®(CSM) Training
  • 16 Hours
Bestseller
PwCupGrad CampusCertification Program in Financial Modelling & Analysis in association with PwC India
  • 4 Months
upGrad KnowledgeHutupGrad KnowledgeHutSAFe® 6.0 POPM Certification
  • 16 Hours
O.P.Jindal Global UniversityO.P.Jindal Global UniversityMaster of Science in Artificial Intelligence and Data Science
  • 12 Months
Bestseller
Liverpool John Moores University Liverpool John Moores University MS in Machine Learning & AI
  • 18 Months
Popular
Golden Gate UniversityGolden Gate UniversityDBA in Emerging Technologies with concentration in Generative AI
  • 3 Years
IIIT BangaloreIIIT BangaloreExecutive Post Graduate Programme in Machine Learning & AI
  • 13 Months
Bestseller
IIITBIIITBExecutive Program in Generative AI for Leaders
  • 4 Months
upGradupGradAdvanced Certificate Program in GenerativeAI
  • 4 Months
New
IIIT BangaloreIIIT BangalorePost Graduate Certificate in Machine Learning & Deep Learning (Executive)
  • 8 Months
Bestseller
Jindal Global UniversityJindal Global UniversityMaster of Design in User Experience
  • 12 Months
New
Liverpool Business SchoolLiverpool Business SchoolMBA with Marketing Concentration
  • 18 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityMBA with Marketing Concentration
  • 15 Months
Popular
MICAMICAAdvanced Certificate in Digital Marketing and Communication
  • 6 Months
Bestseller
MICAMICAAdvanced Certificate in Brand Communication Management
  • 5 Months
Popular
upGradupGradDigital Marketing Accelerator Program
  • 05 Months
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Corporate & Financial Law
  • 12 Months
Bestseller
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in AI and Emerging Technologies (Blended Learning Program)
  • 12 Months
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Intellectual Property & Technology Law
  • 12 Months
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Dispute Resolution
  • 12 Months
upGradupGradContract Law Certificate Program
  • Self paced
New
ESGCI, ParisESGCI, ParisDoctorate of Business Administration (DBA) from ESGCI, Paris
  • 36 Months
Golden Gate University Golden Gate University Doctor of Business Administration From Golden Gate University, San Francisco
  • 36 Months
Rushford Business SchoolRushford Business SchoolDoctor of Business Administration from Rushford Business School, Switzerland)
  • 36 Months
Edgewood CollegeEdgewood CollegeDoctorate of Business Administration from Edgewood College
  • 24 Months
Golden Gate UniversityGolden Gate UniversityDBA in Emerging Technologies with Concentration in Generative AI
  • 36 Months
Golden Gate University Golden Gate University DBA in Digital Leadership from Golden Gate University, San Francisco
  • 36 Months
Liverpool Business SchoolLiverpool Business SchoolMBA by Liverpool Business School
  • 18 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityMBA (Master of Business Administration)
  • 15 Months
Popular
O.P.Jindal Global UniversityO.P.Jindal Global UniversityMaster of Business Administration (MBA)
  • 12 Months
New
Deakin Business School and Institute of Management Technology, GhaziabadDeakin Business School and IMT, GhaziabadMBA (Master of Business Administration)
  • 12 Months
Liverpool John Moores UniversityLiverpool John Moores UniversityMS in Data Science
  • 18 Months
Bestseller
O.P.Jindal Global UniversityO.P.Jindal Global UniversityMaster of Science in Artificial Intelligence and Data Science
  • 12 Months
Bestseller
IIIT BangaloreIIIT BangalorePost Graduate Programme in Data Science (Executive)
  • 12 Months
Bestseller
O.P.Jindal Global UniversityO.P.Jindal Global UniversityO.P.Jindal Global University
  • 12 Months
WoolfWoolfMaster of Science in Computer Science
  • 18 Months
New
Liverpool John Moores University Liverpool John Moores University MS in Machine Learning & AI
  • 18 Months
Popular
Golden Gate UniversityGolden Gate UniversityDBA in Emerging Technologies with concentration in Generative AI
  • 3 Years
Rushford, GenevaRushford Business SchoolDoctorate of Business Administration (AI/ML)
  • 36 Months
Ecole Supérieure de Gestion et Commerce International ParisEcole Supérieure de Gestion et Commerce International ParisDBA Specialisation in AI & ML
  • 36 Months
Golden Gate University Golden Gate University Doctor of Business Administration (DBA)
  • 36 Months
Bestseller
Ecole Supérieure de Gestion et Commerce International ParisEcole Supérieure de Gestion et Commerce International ParisDoctorate of Business Administration (DBA)
  • 36 Months
Rushford, GenevaRushford Business SchoolDoctorate of Business Administration (DBA)
  • 36 Months
Liverpool Business SchoolLiverpool Business SchoolMBA with Marketing Concentration
  • 18 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityMBA with Marketing Concentration
  • 15 Months
Popular
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Corporate & Financial Law
  • 12 Months
Bestseller
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Intellectual Property & Technology Law
  • 12 Months
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Dispute Resolution
  • 12 Months
IIITBIIITBExecutive Program in Generative AI for Leaders
  • 4 Months
New
IIIT BangaloreIIIT BangaloreExecutive Post Graduate Programme in Machine Learning & AI
  • 13 Months
Bestseller
upGradupGradData Science Bootcamp with AI
  • 6 Months
New
upGradupGradAdvanced Certificate Program in GenerativeAI
  • 4 Months
New
KnowledgeHut upGradKnowledgeHut upGradSAFe® 6.0 Certified ScrumMaster (SSM) Training
  • Self-Paced
upGrad KnowledgeHutupGrad KnowledgeHutCertified ScrumMaster®(CSM) Training
  • 16 Hours
upGrad KnowledgeHutupGrad KnowledgeHutLeading SAFe® 6.0 Certification
  • 16 Hours
KnowledgeHut upGradKnowledgeHut upGradPMP® certification
  • Self-Paced
upGrad KnowledgeHutupGrad KnowledgeHutAWS Solutions Architect Certification
  • 32 Hours
upGrad KnowledgeHutupGrad KnowledgeHutAzure Administrator Certification (AZ-104)
  • 24 Hours
KnowledgeHut upGradKnowledgeHut upGradAWS Cloud Practioner Essentials Certification
  • 1 Week
KnowledgeHut upGradKnowledgeHut upGradAzure Data Engineering Training (DP-203)
  • 1 Week
MICAMICAAdvanced Certificate in Digital Marketing and Communication
  • 6 Months
Bestseller
MICAMICAAdvanced Certificate in Brand Communication Management
  • 5 Months
Popular
IIM KozhikodeIIM KozhikodeProfessional Certification in HR Management and Analytics
  • 6 Months
Bestseller
Duke CEDuke CEPost Graduate Certificate in Product Management
  • 4-8 Months
Bestseller
Loyola Institute of Business Administration (LIBA)Loyola Institute of Business Administration (LIBA)Executive PG Programme in Human Resource Management
  • 11 Months
Popular
Goa Institute of ManagementGoa Institute of ManagementExecutive PG Program in Healthcare Management
  • 11 Months
IMT GhaziabadIMT GhaziabadAdvanced General Management Program
  • 11 Months
Golden Gate UniversityGolden Gate UniversityProfessional Certificate in Global Business Management
  • 6-8 Months
upGradupGradContract Law Certificate Program
  • Self paced
New
IU, GermanyIU, GermanyMaster of Business Administration (90 ECTS)
  • 18 Months
Bestseller
IU, GermanyIU, GermanyMaster in International Management (120 ECTS)
  • 24 Months
Popular
IU, GermanyIU, GermanyB.Sc. Computer Science (180 ECTS)
  • 36 Months
Clark UniversityClark UniversityMaster of Business Administration
  • 23 Months
New
Golden Gate UniversityGolden Gate UniversityMaster of Business Administration
  • 20 Months
Clark University, USClark University, USMS in Project Management
  • 20 Months
New
Edgewood CollegeEdgewood CollegeMaster of Business Administration
  • 23 Months
The American Business SchoolThe American Business SchoolMBA with specialization
  • 23 Months
New
Aivancity ParisAivancity ParisMSc Artificial Intelligence Engineering
  • 24 Months
Aivancity ParisAivancity ParisMSc Data Engineering
  • 24 Months
The American Business SchoolThe American Business SchoolMBA with specialization
  • 23 Months
New
Aivancity ParisAivancity ParisMSc Artificial Intelligence Engineering
  • 24 Months
Aivancity ParisAivancity ParisMSc Data Engineering
  • 24 Months
upGradupGradData Science Bootcamp with AI
  • 6 Months
Popular
upGrad KnowledgeHutupGrad KnowledgeHutData Engineer Bootcamp
  • Self-Paced
upGradupGradFull Stack Software Development Bootcamp
  • 6 Months
Bestseller
KnowledgeHut upGradKnowledgeHut upGradBackend Development Bootcamp
  • Self-Paced
upGradupGradUI/UX Bootcamp
  • 3 Months
upGradupGradCloud Computing Bootcamp
  • 7.5 Months
PwCupGrad CampusCertification Program in Financial Modelling & Analysis in association with PwC India
  • 5 Months
upGrad KnowledgeHutupGrad KnowledgeHutSAFe® 6.0 POPM Certification
  • 16 Hours
upGradupGradDigital Marketing Accelerator Program
  • 05 Months
upGradupGradAdvanced Certificate Program in GenerativeAI
  • 4 Months
New
upGradupGradData Science Bootcamp with AI
  • 6 Months
Popular
upGradupGradFull Stack Software Development Bootcamp
  • 6 Months
Bestseller
upGradupGradUI/UX Bootcamp
  • 3 Months
PwCupGrad CampusCertification Program in Financial Modelling & Analysis in association with PwC India
  • 4 Months
upGradupGradCertificate Course in Business Analytics & Consulting in association with PwC India
  • 06 Months
upGradupGradDigital Marketing Accelerator Program
  • 05 Months

Apache Hive Architecture & Commands: Modes, Characteristics & Applications

Updated on 31 October, 2022

6.05K+ views
10 min read

What is Hive?

The Apache hive is an open-source data warehousing tool developed by Facebook for distributed processing and data analytics. It is developed on top of the Hadoop Distributed File System (HDFS). A mechanism for projecting structure onto the data in Hadoop is provided by Hive. A SQL-like language called HiveQL (HQL) is used to query that data. There is a similarity between the tables in Hive and tables in a relational database. Hive queries can be easily written by whoever is familiar with SQL. 

A few features of Hive are:

  • Storage of schema information into a database and the processed data into HDFS.
  • Designed for OLAP.
  • The querying language is HiveQL or HQL, which is similar to SQL.
  • It is fast, familiar, scalable, and extensible.

Uses of Hive

  • It is the Apache Hive distributed storage.
  • Tools are provided that enable the users to easily extract, transform, and load data.
  • A variety of data formats are offered for providing the structure.
  • Files stored in Hadoop Distributed File System (HDFS) can be accessed by Hive.

Commands of Hive

The hive commands are:

1. Data Definition Language (DDL): The tables and other objects in the database are built and modified through these commands.

  • CREATE: It is used to create a table or Database.
  • SHOW: It is used to show Database, Table, Properties, etc.
  • ALTER: It is used to make changes to the existing table.
  • DESCRIBE: It describes the table columns.
  • TRUNCATE: Used to permanently truncate and delete the rows of tables.
  • DELETE: Deletes the table data, but can be restored. 

2. Data Manipulation Language (DML): used to retrieve, store, modify, delete, insert, and update data in the database.

  • Syntax for LOAD, INSERT Statements
LOAD data <LOCAL> inpath <file path> into table [tablename]

  • After loading of the data the data manipulation commands are used to retrieve the data.
  • Count aggregate function is used to count the total number of the records in a table.
  • “create external” keyword is used to create a table and provides a location where the table will be created. An EXTERNAL table points to any HDFS location for its storage.
  • Insert commands are used to load the data Hive table. The “insert overwrite” is used to overwrite the existing data and “insert into” is used to append the data into an existing data.
  • A table is divided into partitions by the “partitioned by” command and divided into buckets by “clustered by” command.
  • Insertion of data throws errors as the dynamic partition is not enabled. Therefore, the following parameters are to be set in the Hive shell.

set hive.exec.dynamic.partition=true;

To enable dynamic partitions, by default, it’s false

set hive.exec.dynamic.partition.mode=nonstrict;

  • ‘Drop Table’ command deletes the data and metadata for a table
  • Aggregation: Syntax: 
Select count (DISTINCT category) from tablename; 

The command will count different categories of ‘cate’ tables.

  • Grouping:  Syntax: 
Select category, sum( amount) from txt records group by category

The result set will be grouped into one or more columns.

  • Join Operation: perform to combine fields from two tables by using values common to each column.
  • Left outer join: For table A and B, left outer join is to contain all records of the “left” table (A), even if the join-condition does not find any matching record in the “right” table (B).
  • Right Outer Join: Every row from the “right” table (B) will appear in the joined table at least once.
  • Full join: The joined table will contain all records from both tables The joined table will contain all records from both tables.

Hive Architecture

The apache hive architecture is shown in Figure 1.

List of Major Components

The major components of the hive architecture are:

1. Hive client

Different applications written in languages like Java, Python, C++, etc. are communicated through the use of different drivers provided by Hive. It can be written in any language as per choice. The Clients and servers in turn communicate with the Hive server in the Hive services.

Mostly they are categorized into three types:

  • Thrift Client: It is based on Apache Thrift to serve a request from a Thrift client. The Thrift client will be used for communication for the Thrift-based applications.
  • JDBC Client: JDBC is provided for Java-related applications. Java applications are connected to the Hive using the JDBC driver. It further uses the Thrift to communicate with the Hive server. 
  • ODBC Client: The applications based on the ODBC protocol are allowed to connect to the Hive through the ODBC drivers. Similar to JDBC, it uses Thrift to communicate to the Hive server.

2. Hive Services

Hive services provide means for the interactions of Hive with the Clients. Any query-related operations that have to be performed by the Client will have to be communicated through the Hire services. For Data Definition Language (DDL) operations, CLI acts as the Hive service.

All the drivers have to communicate with the Hive server and then to the main driver in the Hive services. Drivers in the Hive services represent the main driver which communicates with the Client specific applications and all types of JDBC, ODBC, etc. The requests from different applications are processed by the driver to the metastore and field systems which will be further processed.

Services offered by Hive are:

  • Beeline: The Beeline is a command shell where a user can submit its queries to the system. It is supported by HiveServer2. It is a JDBC client that is based on SQLLINE CLI.
  • Hive Server 2: Clients are allowed to execute the queries against the hive. A successor of HiveServer1, it allows the execution of multiple queries from multiple clients. It provides the best support for open API clients like JDBC and ODBC.
  • Hive Driver: The user submits the HiveQL statements to the Hive driver through the command shell. It sends the query to the compiler and creates session handles for the query.
  • Hive compiler: The Hive compiler is used for passing the query. Using the metadata stored in the metastore, the Hive compiler performs semantic analysis and type checking on the different query blocks and expressions. An execution plan is then generated by the compiler which is the DAG (Directed Acyclic Graph). Each stage of the DAG is a metadata operation, operation on HDFS, or is a map/reduce job.
  • Optimizer: The main role of the optimizer is to perform transformation operations on the execution plan. It increases efficiency and scalability by splitting the tasks.
  • Execution Engine: After the completion of the compilation and optimization steps, it is the role of the execution engine that executes the execution plan created by the compiler. The plan is executed using Hadoop in order of their dependencies.
  • Metastore: Metastore is generally a relational database that stores the metadata information related to the structure of the tables and partitions. It is a central repository that also includes storing information of column and column types. Information related to serializer and deserializer, are also stored in Metastore which is required for reading/write operations along with HDFS files which store data. A Thrift interface is provided by Metastore for querying and manipulating Hive metadata.

Metastore can be configured in two modes:

  • Remote: This mode is useful for non-Java applications and in the remote mode the metastore is a Thrift service.
  • Embedded: In this mode, the client can directly interact with the metastore through the JDBC.
  • HCatalog: The table and storage management layer for Hadoop is the HCatalog. Different data processing tools for reading and writing data on the grid are available like Pig, MapReduce, etc. Built on the top of Hive metastore, the tabular data of Hive metastore is exposed to other data processing tools.
  • WebHCat: WebHCat is an HTTP interface and REST API for HCatalog. It performs Hive metadata operations and offers a service of running Hadoop MapReduce (or YARN), Pig, Hive jobs.

3. Processing and Resource Management

The execution of the queries is carried out by an internal MapReduce framework. 

The MapReduce framework is a software framework for processing large amounts of data on large clusters of commodity hardware. The data is split into chunks and then processed by map-reduce tasks.

4. Distributed Storage

The Hive services communicate with the Hive storage for performing the following actions:

  • The Hive “Meta storage database” holds the metadata information of tables created in Hive.
  • The Hadoop cluster on HDFS will store Query results and the data loaded onto the tables.

Different Modes of Hive

Depending on the size of the data, Hive can operate in two modes.

1. Local Mode

The local mode of Hive is used when 

  • The Hadoop installed has one data node and is installed under pseudo mode.
  • The data size of a single local machine is smaller.
  • Fast processing on local machines due to the smaller data sets present.

2. Map reduce mode

The Map reduce mode of Hive is used when 

  • Hadoop has multiple data nodes with distributed data across the different nodes.
  • The data size is larger and parallel execution of the query is required.
  • Large data sets can be processed with better performance. 

Characteristics of Hive

  • Data is loaded into the tables after the tables and the databases are created.
  • Only structured data stored in tables can be managed and queried by Hive.
  • The Hive framework has features of optimization and usability while dealing with the structured data which is not present in Map Reduce.
  • For ease of use, Hive SQL-inspired language is a simpler approach compared to the complex programming language of Map Reduce. Familiar concepts of tables, rows, columns, etc. are used in Hive.
  • For increasing the performance of the queries, Hive can partition the data using a directory structure. 
  • Hive contains an important component called the “Metastore” which resides in a relational database and stores schema information. Two methods can be used to interact with Hive: Web GUI and Java Database Connectivity (JDBC) interface.
  • A command-line interface (CLI) is used for most of the interactions. The CLI is used for writing the Hive queries using the Hive Query Language (HQL).
  • The HQL syntax is similar to that of the SQL syntax.
  • Four file formats are supported by Hive; TEXTFILE, SEQUENCEFILE, ORC, and RCFILE (Record Columnar File).

Conclusion

Apache Hive is an open-source data warehousing tool consisting of major components like Hive clients, Hive services, Processing framework and Resource Management, and Distributed Storage.

It is built on top of the Hadoop ecosystem for the processing of structures and semi-structured data. The user interface provided by Hive enables the user to submit their queries in Hive Query Language (HQL). This is passed to the compiler for generating an execution plan. The plan is finally executed by the execution engine.

If you are interested to know more about Big Data, check out our Advanced Certificate Programme in Big Data from IIIT Bangalore.

Check our other Software Engineering Courses at upGrad.

Frequently Asked Questions (FAQs)

1. What are the primary differences between Apache Hive and Apache HBase?

Apache Hive uses SQL query functions and is based on a distributed data warehouse system. Apache HBase, on the other hand, doesn’t need SQL to manage its distributed data and offers real-time and consistent access to petabytes of data. Secondly, Apache Hive has a defined schema for all the tables that it uses, whereas Apache HBase is schema-free. In terms of data, Apache Hive has extensive support for supported and unsupported data but Apache HBase shares its support only for unstructured data. Apache Hive uses Apache Tez or MapReduce for batch processing, while Apache HBase follows real-time processing.

2. What is the real-time use case of Hive?

Airbnb has over 2.9 million hosts listed with them that interlink people by offering them designated places to live. Moreover, it supports more than 800k night stays. Airbnb runs Apache Hive by using Amazon EMR on an S3 data lake. When the hive is run on EMR clusters, Airbnb analysts use SQL queries on the piece of data that is present in the S3 data lake. Therefore, Airbnb can now accommodate cost attribution since its expenses are now reduced. Furthermore, there is a significant jump in Apache Spark jobs by three times their original speed.

3. What are some key benefits of working with Apache Hive?

If you are to work with transactions, reports, queries, and data, Apache Hive shares its benefits. Apache Hive is also very easy to use as there is very minimal effort required to understand SQL queries. Next, it is scalable, cost-effective, and flexible which makes storing tons of data convenient. Moreover, all of Apache’s data is stored in HDFS which offers an upper hand over traditional databases. Another benefit is its exceptional capacity to execute more than 100,000 queries/hour providing support to datasets. Also, insert-only tables in Apache Hive have a very low overhead since no renaming is needed.