Apache Hive Architecture & Commands: Modes, Characteristics & Applications
Updated on Oct 31, 2022 | 10 min read | 6.2k views
Share:
For working professionals
For fresh graduates
More
Updated on Oct 31, 2022 | 10 min read | 6.2k views
Share:
Table of Contents
The Apache hive is an open-source data warehousing tool developed by Facebook for distributed processing and data analytics. It is developed on top of the Hadoop Distributed File System (HDFS). A mechanism for projecting structure onto the data in Hadoop is provided by Hive. A SQL-like language called HiveQL (HQL) is used to query that data. There is a similarity between the tables in Hive and tables in a relational database. Hive queries can be easily written by whoever is familiar with SQL.
A few features of Hive are:
The hive commands are:
1. Data Definition Language (DDL): The tables and other objects in the database are built and modified through these commands.
2. Data Manipulation Language (DML): used to retrieve, store, modify, delete, insert, and update data in the database.
LOAD data <LOCAL> inpath <file path> into table [tablename]
set hive.exec.dynamic.partition=true;
To enable dynamic partitions, by default, it’s false
set hive.exec.dynamic.partition.mode=nonstrict;
Select count (DISTINCT category) from tablename;
The command will count different categories of ‘cate’ tables.
Select category, sum( amount) from txt records group by category
The result set will be grouped into one or more columns.
The apache hive architecture is shown in Figure 1.
List of Major Components
The major components of the hive architecture are:
Different applications written in languages like Java, Python, C++, etc. are communicated through the use of different drivers provided by Hive. It can be written in any language as per choice. The Clients and servers in turn communicate with the Hive server in the Hive services.
Mostly they are categorized into three types:
Hive services provide means for the interactions of Hive with the Clients. Any query-related operations that have to be performed by the Client will have to be communicated through the Hire services. For Data Definition Language (DDL) operations, CLI acts as the Hive service.
All the drivers have to communicate with the Hive server and then to the main driver in the Hive services. Drivers in the Hive services represent the main driver which communicates with the Client specific applications and all types of JDBC, ODBC, etc. The requests from different applications are processed by the driver to the metastore and field systems which will be further processed.
Services offered by Hive are:
Metastore can be configured in two modes:
The execution of the queries is carried out by an internal MapReduce framework.
The MapReduce framework is a software framework for processing large amounts of data on large clusters of commodity hardware. The data is split into chunks and then processed by map-reduce tasks.
The Hive services communicate with the Hive storage for performing the following actions:
Depending on the size of the data, Hive can operate in two modes.
1. Local Mode
The local mode of Hive is used when
2. Map reduce mode
The Map reduce mode of Hive is used when
Apache Hive is an open-source data warehousing tool consisting of major components like Hive clients, Hive services, Processing framework and Resource Management, and Distributed Storage.
It is built on top of the Hadoop ecosystem for the processing of structures and semi-structured data. The user interface provided by Hive enables the user to submit their queries in Hive Query Language (HQL). This is passed to the compiler for generating an execution plan. The plan is finally executed by the execution engine.
If you are interested to know more about Big Data, check out our Advanced Certificate Programme in Big Data from IIIT Bangalore.
Check our other Software Engineering Courses at upGrad.
Get Free Consultation
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources