Apache Hive Architecture & Commands: Modes, Characteristics & Applications
By Rohit Sharma
Updated on Oct 31, 2022 | 10 min read | 6.76K+ views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Oct 31, 2022 | 10 min read | 6.76K+ views
Share:
Table of Contents
The Apache hive is an open-source data warehousing tool developed by Facebook for distributed processing and data analytics. It is developed on top of the Hadoop Distributed File System (HDFS). A mechanism for projecting structure onto the data in Hadoop is provided by Hive. A SQL-like language called HiveQL (HQL) is used to query that data. There is a similarity between the tables in Hive and tables in a relational database. Hive queries can be easily written by whoever is familiar with SQL.
A few features of Hive are:
The hive commands are:
1. Data Definition Language (DDL): The tables and other objects in the database are built and modified through these commands.
2. Data Manipulation Language (DML): used to retrieve, store, modify, delete, insert, and update data in the database.
LOAD data <LOCAL> inpath <file path> into table [tablename]
Popular Data Science Programs
set hive.exec.dynamic.partition=true;
To enable dynamic partitions, by default, it’s false
set hive.exec.dynamic.partition.mode=nonstrict;
Select count (DISTINCT category) from tablename;
The command will count different categories of ‘cate’ tables.
Select category, sum( amount) from txt records group by category
The result set will be grouped into one or more columns.
The apache hive architecture is shown in Figure 1.
List of Major Components
The major components of the hive architecture are:
Different applications written in languages like Java, Python, C++, etc. are communicated through the use of different drivers provided by Hive. It can be written in any language as per choice. The Clients and servers in turn communicate with the Hive server in the Hive services.
Mostly they are categorized into three types:
Hive services provide means for the interactions of Hive with the Clients. Any query-related operations that have to be performed by the Client will have to be communicated through the Hire services. For Data Definition Language (DDL) operations, CLI acts as the Hive service.
All the drivers have to communicate with the Hive server and then to the main driver in the Hive services. Drivers in the Hive services represent the main driver which communicates with the Client specific applications and all types of JDBC, ODBC, etc. The requests from different applications are processed by the driver to the metastore and field systems which will be further processed.
Services offered by Hive are:
Metastore can be configured in two modes:
Data Science Courses to upskill
Explore Data Science Courses for Career Progression
The execution of the queries is carried out by an internal MapReduce framework.
The MapReduce framework is a software framework for processing large amounts of data on large clusters of commodity hardware. The data is split into chunks and then processed by map-reduce tasks.
The Hive services communicate with the Hive storage for performing the following actions:
Depending on the size of the data, Hive can operate in two modes.
1. Local Mode
The local mode of Hive is used when
2. Map reduce mode
The Map reduce mode of Hive is used when
Apache Hive is an open-source data warehousing tool consisting of major components like Hive clients, Hive services, Processing framework and Resource Management, and Distributed Storage.
It is built on top of the Hadoop ecosystem for the processing of structures and semi-structured data. The user interface provided by Hive enables the user to submit their queries in Hive Query Language (HQL). This is passed to the compiler for generating an execution plan. The plan is finally executed by the execution engine.
If you are interested to know more about Big Data, check out our Advanced Certificate Programme in Big Data from IIIT Bangalore.
Check our other Software Engineering Courses at upGrad.
Subscribe to upGrad's Newsletter
Join thousands of learners who receive useful tips
Apache Hive uses SQL query functions and is based on a distributed data warehouse system. Apache HBase, on the other hand, doesn’t need SQL to manage its distributed data and offers real-time and consistent access to petabytes of data. Secondly, Apache Hive has a defined schema for all the tables that it uses, whereas Apache HBase is schema-free. In terms of data, Apache Hive has extensive support for supported and unsupported data but Apache HBase shares its support only for unstructured data. Apache Hive uses Apache Tez or MapReduce for batch processing, while Apache HBase follows real-time processing.
Airbnb has over 2.9 million hosts listed with them that interlink people by offering them designated places to live. Moreover, it supports more than 800k night stays. Airbnb runs Apache Hive by using Amazon EMR on an S3 data lake. When the hive is run on EMR clusters, Airbnb analysts use SQL queries on the piece of data that is present in the S3 data lake. Therefore, Airbnb can now accommodate cost attribution since its expenses are now reduced. Furthermore, there is a significant jump in Apache Spark jobs by three times their original speed.
If you are to work with transactions, reports, queries, and data, Apache Hive shares its benefits. Apache Hive is also very easy to use as there is very minimal effort required to understand SQL queries. Next, it is scalable, cost-effective, and flexible which makes storing tons of data convenient. Moreover, all of Apache’s data is stored in HDFS which offers an upper hand over traditional databases. Another benefit is its exceptional capacity to execute more than 100,000 queries/hour providing support to datasets. Also, insert-only tables in Apache Hive have a very low overhead since no renaming is needed.
834 articles published
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...
Speak with Data Science Expert
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources