Explore Courses
Liverpool Business SchoolLiverpool Business SchoolMBA by Liverpool Business School
  • 18 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityMBA (Master of Business Administration)
  • 15 Months
Popular
O.P.Jindal Global UniversityO.P.Jindal Global UniversityMaster of Business Administration (MBA)
  • 12 Months
New
Birla Institute of Management Technology Birla Institute of Management Technology Post Graduate Diploma in Management (BIMTECH)
  • 24 Months
Liverpool John Moores UniversityLiverpool John Moores UniversityMS in Data Science
  • 18 Months
Popular
IIIT BangaloreIIIT BangalorePost Graduate Programme in Data Science & AI (Executive)
  • 12 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityDBA in Emerging Technologies with concentration in Generative AI
  • 3 Years
upGradupGradData Science Bootcamp with AI
  • 6 Months
New
University of MarylandIIIT BangalorePost Graduate Certificate in Data Science & AI (Executive)
  • 8-8.5 Months
upGradupGradData Science Bootcamp with AI
  • 6 months
Popular
upGrad KnowledgeHutupGrad KnowledgeHutData Engineer Bootcamp
  • Self-Paced
upGradupGradCertificate Course in Business Analytics & Consulting in association with PwC India
  • 06 Months
OP Jindal Global UniversityOP Jindal Global UniversityMaster of Design in User Experience Design
  • 12 Months
Popular
WoolfWoolfMaster of Science in Computer Science
  • 18 Months
New
Jindal Global UniversityJindal Global UniversityMaster of Design in User Experience
  • 12 Months
New
Rushford, GenevaRushford Business SchoolDBA Doctorate in Technology (Computer Science)
  • 36 Months
IIIT BangaloreIIIT BangaloreCloud Computing and DevOps Program (Executive)
  • 8 Months
New
upGrad KnowledgeHutupGrad KnowledgeHutAWS Solutions Architect Certification
  • 32 Hours
upGradupGradFull Stack Software Development Bootcamp
  • 6 Months
Popular
upGradupGradUI/UX Bootcamp
  • 3 Months
upGradupGradCloud Computing Bootcamp
  • 7.5 Months
Golden Gate University Golden Gate University Doctor of Business Administration in Digital Leadership
  • 36 Months
New
Jindal Global UniversityJindal Global UniversityMaster of Design in User Experience
  • 12 Months
New
Golden Gate University Golden Gate University Doctor of Business Administration (DBA)
  • 36 Months
Bestseller
Ecole Supérieure de Gestion et Commerce International ParisEcole Supérieure de Gestion et Commerce International ParisDoctorate of Business Administration (DBA)
  • 36 Months
Rushford, GenevaRushford Business SchoolDoctorate of Business Administration (DBA)
  • 36 Months
KnowledgeHut upGradKnowledgeHut upGradSAFe® 6.0 Certified ScrumMaster (SSM) Training
  • Self-Paced
KnowledgeHut upGradKnowledgeHut upGradPMP® certification
  • Self-Paced
IIM KozhikodeIIM KozhikodeProfessional Certification in HR Management and Analytics
  • 6 Months
Bestseller
Duke CEDuke CEPost Graduate Certificate in Product Management
  • 4-8 Months
Bestseller
upGrad KnowledgeHutupGrad KnowledgeHutLeading SAFe® 6.0 Certification
  • 16 Hours
Popular
upGrad KnowledgeHutupGrad KnowledgeHutCertified ScrumMaster®(CSM) Training
  • 16 Hours
Bestseller
PwCupGrad CampusCertification Program in Financial Modelling & Analysis in association with PwC India
  • 4 Months
upGrad KnowledgeHutupGrad KnowledgeHutSAFe® 6.0 POPM Certification
  • 16 Hours
O.P.Jindal Global UniversityO.P.Jindal Global UniversityMaster of Science in Artificial Intelligence and Data Science
  • 12 Months
Bestseller
Liverpool John Moores University Liverpool John Moores University MS in Machine Learning & AI
  • 18 Months
Popular
Golden Gate UniversityGolden Gate UniversityDBA in Emerging Technologies with concentration in Generative AI
  • 3 Years
IIIT BangaloreIIIT BangaloreExecutive Post Graduate Programme in Machine Learning & AI
  • 13 Months
Bestseller
IIITBIIITBExecutive Program in Generative AI for Leaders
  • 4 Months
upGradupGradAdvanced Certificate Program in GenerativeAI
  • 4 Months
New
IIIT BangaloreIIIT BangalorePost Graduate Certificate in Machine Learning & Deep Learning (Executive)
  • 8 Months
Bestseller
Jindal Global UniversityJindal Global UniversityMaster of Design in User Experience
  • 12 Months
New
Liverpool Business SchoolLiverpool Business SchoolMBA with Marketing Concentration
  • 18 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityMBA with Marketing Concentration
  • 15 Months
Popular
MICAMICAAdvanced Certificate in Digital Marketing and Communication
  • 6 Months
Bestseller
MICAMICAAdvanced Certificate in Brand Communication Management
  • 5 Months
Popular
upGradupGradDigital Marketing Accelerator Program
  • 05 Months
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Corporate & Financial Law
  • 12 Months
Bestseller
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in AI and Emerging Technologies (Blended Learning Program)
  • 12 Months
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Intellectual Property & Technology Law
  • 12 Months
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Dispute Resolution
  • 12 Months
upGradupGradContract Law Certificate Program
  • Self paced
New
ESGCI, ParisESGCI, ParisDoctorate of Business Administration (DBA) from ESGCI, Paris
  • 36 Months
Golden Gate University Golden Gate University Doctor of Business Administration From Golden Gate University, San Francisco
  • 36 Months
Rushford Business SchoolRushford Business SchoolDoctor of Business Administration from Rushford Business School, Switzerland)
  • 36 Months
Edgewood CollegeEdgewood CollegeDoctorate of Business Administration from Edgewood College
  • 24 Months
Golden Gate UniversityGolden Gate UniversityDBA in Emerging Technologies with Concentration in Generative AI
  • 36 Months
Golden Gate University Golden Gate University DBA in Digital Leadership from Golden Gate University, San Francisco
  • 36 Months
Liverpool Business SchoolLiverpool Business SchoolMBA by Liverpool Business School
  • 18 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityMBA (Master of Business Administration)
  • 15 Months
Popular
O.P.Jindal Global UniversityO.P.Jindal Global UniversityMaster of Business Administration (MBA)
  • 12 Months
New
Deakin Business School and Institute of Management Technology, GhaziabadDeakin Business School and IMT, GhaziabadMBA (Master of Business Administration)
  • 12 Months
Liverpool John Moores UniversityLiverpool John Moores UniversityMS in Data Science
  • 18 Months
Bestseller
O.P.Jindal Global UniversityO.P.Jindal Global UniversityMaster of Science in Artificial Intelligence and Data Science
  • 12 Months
Bestseller
IIIT BangaloreIIIT BangalorePost Graduate Programme in Data Science (Executive)
  • 12 Months
Bestseller
O.P.Jindal Global UniversityO.P.Jindal Global UniversityO.P.Jindal Global University
  • 12 Months
WoolfWoolfMaster of Science in Computer Science
  • 18 Months
New
Liverpool John Moores University Liverpool John Moores University MS in Machine Learning & AI
  • 18 Months
Popular
Golden Gate UniversityGolden Gate UniversityDBA in Emerging Technologies with concentration in Generative AI
  • 3 Years
Rushford, GenevaRushford Business SchoolDoctorate of Business Administration (AI/ML)
  • 36 Months
Ecole Supérieure de Gestion et Commerce International ParisEcole Supérieure de Gestion et Commerce International ParisDBA Specialisation in AI & ML
  • 36 Months
Golden Gate University Golden Gate University Doctor of Business Administration (DBA)
  • 36 Months
Bestseller
Ecole Supérieure de Gestion et Commerce International ParisEcole Supérieure de Gestion et Commerce International ParisDoctorate of Business Administration (DBA)
  • 36 Months
Rushford, GenevaRushford Business SchoolDoctorate of Business Administration (DBA)
  • 36 Months
Liverpool Business SchoolLiverpool Business SchoolMBA with Marketing Concentration
  • 18 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityMBA with Marketing Concentration
  • 15 Months
Popular
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Corporate & Financial Law
  • 12 Months
Bestseller
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Intellectual Property & Technology Law
  • 12 Months
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Dispute Resolution
  • 12 Months
IIITBIIITBExecutive Program in Generative AI for Leaders
  • 4 Months
New
IIIT BangaloreIIIT BangaloreExecutive Post Graduate Programme in Machine Learning & AI
  • 13 Months
Bestseller
upGradupGradData Science Bootcamp with AI
  • 6 Months
New
upGradupGradAdvanced Certificate Program in GenerativeAI
  • 4 Months
New
KnowledgeHut upGradKnowledgeHut upGradSAFe® 6.0 Certified ScrumMaster (SSM) Training
  • Self-Paced
upGrad KnowledgeHutupGrad KnowledgeHutCertified ScrumMaster®(CSM) Training
  • 16 Hours
upGrad KnowledgeHutupGrad KnowledgeHutLeading SAFe® 6.0 Certification
  • 16 Hours
KnowledgeHut upGradKnowledgeHut upGradPMP® certification
  • Self-Paced
upGrad KnowledgeHutupGrad KnowledgeHutAWS Solutions Architect Certification
  • 32 Hours
upGrad KnowledgeHutupGrad KnowledgeHutAzure Administrator Certification (AZ-104)
  • 24 Hours
KnowledgeHut upGradKnowledgeHut upGradAWS Cloud Practioner Essentials Certification
  • 1 Week
KnowledgeHut upGradKnowledgeHut upGradAzure Data Engineering Training (DP-203)
  • 1 Week
MICAMICAAdvanced Certificate in Digital Marketing and Communication
  • 6 Months
Bestseller
MICAMICAAdvanced Certificate in Brand Communication Management
  • 5 Months
Popular
IIM KozhikodeIIM KozhikodeProfessional Certification in HR Management and Analytics
  • 6 Months
Bestseller
Duke CEDuke CEPost Graduate Certificate in Product Management
  • 4-8 Months
Bestseller
Loyola Institute of Business Administration (LIBA)Loyola Institute of Business Administration (LIBA)Executive PG Programme in Human Resource Management
  • 11 Months
Popular
Goa Institute of ManagementGoa Institute of ManagementExecutive PG Program in Healthcare Management
  • 11 Months
IMT GhaziabadIMT GhaziabadAdvanced General Management Program
  • 11 Months
Golden Gate UniversityGolden Gate UniversityProfessional Certificate in Global Business Management
  • 6-8 Months
upGradupGradContract Law Certificate Program
  • Self paced
New
IU, GermanyIU, GermanyMaster of Business Administration (90 ECTS)
  • 18 Months
Bestseller
IU, GermanyIU, GermanyMaster in International Management (120 ECTS)
  • 24 Months
Popular
IU, GermanyIU, GermanyB.Sc. Computer Science (180 ECTS)
  • 36 Months
Clark UniversityClark UniversityMaster of Business Administration
  • 23 Months
New
Golden Gate UniversityGolden Gate UniversityMaster of Business Administration
  • 20 Months
Clark University, USClark University, USMS in Project Management
  • 20 Months
New
Edgewood CollegeEdgewood CollegeMaster of Business Administration
  • 23 Months
The American Business SchoolThe American Business SchoolMBA with specialization
  • 23 Months
New
Aivancity ParisAivancity ParisMSc Artificial Intelligence Engineering
  • 24 Months
Aivancity ParisAivancity ParisMSc Data Engineering
  • 24 Months
The American Business SchoolThe American Business SchoolMBA with specialization
  • 23 Months
New
Aivancity ParisAivancity ParisMSc Artificial Intelligence Engineering
  • 24 Months
Aivancity ParisAivancity ParisMSc Data Engineering
  • 24 Months
upGradupGradData Science Bootcamp with AI
  • 6 Months
Popular
upGrad KnowledgeHutupGrad KnowledgeHutData Engineer Bootcamp
  • Self-Paced
upGradupGradFull Stack Software Development Bootcamp
  • 6 Months
Bestseller
KnowledgeHut upGradKnowledgeHut upGradBackend Development Bootcamp
  • Self-Paced
upGradupGradUI/UX Bootcamp
  • 3 Months
upGradupGradCloud Computing Bootcamp
  • 7.5 Months
PwCupGrad CampusCertification Program in Financial Modelling & Analysis in association with PwC India
  • 5 Months
upGrad KnowledgeHutupGrad KnowledgeHutSAFe® 6.0 POPM Certification
  • 16 Hours
upGradupGradDigital Marketing Accelerator Program
  • 05 Months
upGradupGradAdvanced Certificate Program in GenerativeAI
  • 4 Months
New
upGradupGradData Science Bootcamp with AI
  • 6 Months
Popular
upGradupGradFull Stack Software Development Bootcamp
  • 6 Months
Bestseller
upGradupGradUI/UX Bootcamp
  • 3 Months
PwCupGrad CampusCertification Program in Financial Modelling & Analysis in association with PwC India
  • 4 Months
upGradupGradCertificate Course in Business Analytics & Consulting in association with PwC India
  • 06 Months
upGradupGradDigital Marketing Accelerator Program
  • 05 Months

Bucketing in Hive: Create Bucketed Table in Hive

Updated on 26 October, 2022

9.34K+ views
11 min read

Working with a big dataset can be challenging. There’s a lot to keep track of and one small error can disturb your entire workflow. One of the most prominent tools for managing large datasets is bucketing. 

This article will tell you about how you can perform bucketing in Hive. We’ll explore multiple implementations of this function through examples.

What is Bucketing in Hive?

Bucketing is a data organization technique. While partitioning and bucketing in Hive are quite similar concepts, bucketing offers the additional functionality of dividing large datasets into smaller and more manageable sets called buckets. 

With bucketing in Hive, you can decompose a table data set into smaller parts, making them easier to handle. Bucketing allows you to group similar data types and write them to one single file, which enhances your performance while joining tables or reading data. This is a big reason why we use bucketing with partitioning most of the time. 

When Do We Use Bucketing? 

Bucketing is a very useful functionality. If you haven’t used it before, you should keep the following points in mind to determine when to use this function:

  • When a column has a high cardinality, we can’t perform partitioning on it. A very high number of partitions will generate too many Hadoop files which would increase the load on the node. That’s because the node will have to keep the metadata of every partition, and that would affect the performance of that node.
  • You should use bucketing if your queries have several map-side joins. A map-side join is a process where you join two tables by only using the map function without using the reduce function. 

Highlights of Bucketing in Hive

Bucketing is based on the hashing function so it has the following highlights:

  • The hash_function depends on the kind of the bucketing column you have.
  • You should keep in mind that the Records with the same bucketed column would be stored in the same bucket. 
  • This function requires you to use the Clustered By clause to divide a table into buckets. 
  • In the table directory, the Bucket numbering is 1-based and every bucket is a file.
  • Bucketing is a standalone function. This means you can perform bucketing without performing partitioning on a table.
  • A bucketed table creates nearly equally distributed data file sections. 
  • Note that bucketing doesn’t ensure your table would be properly populated. So you’ll have to manage the Data Loading into the buckets yourself, which can be cumbersome. 

Read: Hive Vs Spark

Bucketing in Hive: Example #1

It’d be best to understand bucketing in Hive by using an example. We’ll use the following data for our example:

EMPID FIRSTNAME LASTNAME SPORTS CITY COUNTRY
1001 Emerry Blair Basketball Qutubullapur San Marino
1002 Zephr Stephenson Cricket Neerharen Dominican Republic
1003 Autumn Bean Basketball Neerharen Dominican Republic
1004 Kasimir Vance Badminton Neerharen Dominican Republic
1005 Mufutau Flores   Qutubullapur San Marino
1006 Ayanna Banks Football Neerharen Dominican Republic
1007 Selma Ball Tennis Qutubullapur San Marino
1008 Berk Fuller Badminton Neerharen Dominican Republic
1009 Imogene Terrell   Qutubullapur San Marino
1010 Colorado Hutchinson Tennis Qutubullapur San Marino

Our sample data contains employee information for a sports team. However, some of the employees are not a part of any team. 

Here’s the sample data you can copy-paste to follow along with this example:

id,FirstName,LastName,Sports,City,Country

1001,Emerry, Blair, Basketball, Qutubullapur, San Marino

1002, Zephr, Stephenson, Cricket, Neerharen, Dominican Republic

1003, Autumn, Bean, Basketball, Neerharen, Dominican Republic

1004, Kasimir, Vance, Badminton, Neerharen, Dominican Republic

1005, Mufutau, Flores, Qutubullapur, San Marino

1006, Ayanna, Banks, Football, Neerharen, Dominican Republic

1007,Selma,Ball,Tennis,Qutubullapur,San Marino

1008, Berk, Fuller, Badminton, Neerharen, Dominican Republic

1009,Imogene,Terrell,,Qutubullapur,San Marino

1010,Colorado,Hutchinson,Tennis,Qutubullapur,San Marino

We already know that bucketing allows us to cluster datasets into smaller sections for optimization. Let’s now discuss how one completes this process:

Creating the Base Table

First, we’ll create a table called employee_base:

CREATE TABLE db_bdpbase.employee_base (

    emplid     INT,

    firstname  STRING,

    lastname   STRING,

     sports STRING,

    city       STRING,

    country    STRING

)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ‘,’

STORED AS TEXTFILE

TBLPROPERTIES(“skip.header.line.count”=”1”);

Our sample data has a header which is not needed for bucketing, so we’ll remove it by adding the ‘skip header’ property.

Loading the data into the Base Table

We’ll use the location ‘/usr/bdp/hive/sample_data.csv’ for our sample data and use the following command for loading it into the table:

LOAD DATA INPATH ‘/user/bdp/hive/sample_data.csv’ INTO TABLE  db_bdpbase.employee_base;

Creating the Bucketed Table

In this section, we’ll create a bucketed table. Now we can either make a bucketed table with a partition or without partition. 

Bucketed Table With Partition

In this case, the country is the partition column and we have bucketed the empid column that we sorted in ascending order:

CREATE TABLE db_bdpbase.bucketed_partition_tbl (

empid     INT,

firstname  STRING,

lastname   STRING,

sports STRING,

city       STRING

) PARTITIONED BY(country STRING)

CLUSTERED BY (empid)

SORTED BY (empid ASC) INTO 4 BUCKETS;

Bucketed Table Without Partition

Alternatively, we can create a bucketed table without partition: 

CREATE TABLE db_bdpbase.bucketed_tbl_only (

empid     INT,

firstname  STRING,

lastname   STRING,

city       STRING,

Country STRING

)

CLUSTERED BY (empid)

SORTED BY (empid ASC) INTO 4 BUCKETS;

Here, we have bucketed the table on the same column empid. 

Setting the Property

The default setting for bucketing in Hive is disabled so we enabled it by setting its value to true. The following property would select the number of the clusters and reducers according to the table:

SET hive.enforce.bucketing=TRUE; (NOT needed IN Hive 2.x onward)

Loading Data Into the Bucketed Table

So far, we have created two bucketed tables and a base table with our sample data. Now we’ll load the data into the bucketed table from the base table by using the following command in the bucketed table with partition:

INSERT OVERWRITE TABLE db_bdpbase.bucketed_partition_tbl PARTITION (country) SELECT * FROM db_bdpbase.employee_base;

To load data into the bucketed table without any partition, we’ll use the following command:

INSERT OVERWRITE TABLE db_bdpbase.bucketed_tbl_only SELECT * FROM db_bdpbase.employee_base;

Checking the Bucketed Table Data

After loading the data into the bucketed table, we will check how it is stored in the HDFS. We’ll use the following code to check the bucketed table with partition:

hadoop fs -ls hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/db_bdpbase.db/bucketed_partition_tbl

Data Storage in Bucketed Tables

Every data point gets mapped to a specific according to the following formula:

hash_function(bucket_column) mode num_bucket

Now, consider the first table which we partitioned based on the country, our sample data will get divided into the following sections:

EMPID FIRSTNAME LASTNAME SPORTS CITY COUNTRY
1002 Zephr Stephenson Cricket Neerharen Dominican Republic
1003 Autumn Bean Basketball Neerharen Dominican Republic
1004 Kasimir Vance Badminton Neerharen Dominican Republic
1006 Ayanna Banks Football Neerharen Dominican Republic
1008 Berk Fuller Badminton Neerharen Dominican Republic
EMPID FIRSTNAME LASTNAME SPORTS CITY COUNTRY
1001 Emerry Blair Basketball Qutubullapur San Marino
1005 Mufutau Flores   Qutubullapur San Marino
1007 Selma Ball Tennis Qutubullapur San Marino
1009 Imogene Terrell   Qutubullapur San Marino
1010 Colorado Hutchinson Tennis Qutubullapur San Marino

For Domincan Republic, every row will be stored in the bucket:

hash_function(1002) mode 4 = 2 (Representing index of bucket)

hash_function(1003) mode 4 = 3

hash_function(1004) mode 4 = 0

hash_function(1006) mode 4 = 2

hash_function(1008) mode 4 = 0

Note that the hash_function of INT value will give you the same result. You can check the data in every file at the HDFS location. If you want, you can repeat this process for other countries present in the database. 

Bucketing in Hive: Example #2

As we have already covered the various steps and procedures present in implementing this function, we can try it out easily. The following is a simple example of bucketing in Hive. Here, we have only bucketed the available data into different parts so we can manage it more easily: 

0: jdbc:hive2://cdh-vm.dbaglobe.com:10000/def> create table monthly_taxi_fleet6  

. . . . . . . . . . . . . . . . . . . . . . .> (month char(7),fleet smallint,company varchar(50))  

. . . . . . . . . . . . . . . . . . . . . . .> clustered by (company) into 3 buckets  

. . . . . . . . . . . . . . . . . . . . . . .> stored as avro;

Example using Apache Hive version 1.1.0-cdh5.13.1, hive.enforce.bucketing=false by default  

0: jdbc:hive2://cdh-vm.dbaglobe.com:10000/def> insert into monthly_taxi_fleet6  

. . . . . . . . . . . . . . . . . . . . . . .> select month,fleet,company from monthly_taxi_fleet;

[upgrad@cdh-vm ~]$ hdfs dfs -ls -R /user/hive/warehouse/monthly_taxi_fleet6  

-rwxrwxrwt   1 upgrad hive  25483 2017-12-26 10:40 /user/hive/warehouse/monthly_taxi_fleet6/000000_0

— hive.enforce.bucketing: Whether bucketing is enforced. If true, while inserting into the table, bucketing is enforced.  

— Default Value: Hive 0.x: false, Hive 1.x: false, Hive 2.x: removed, which effectively makes it always true (HIVE-12331)

0: jdbc:hive2://cdh-vm.dbaglobe.com:10000/def> set hive.enforce.bucketing=true;

0: jdbc:hive2://cdh-vm.dbaglobe.com:10000/def> insert into monthly_taxi_fleet6  

. . . . . . . . . . . . . . . . . . . . . . .> select month,fleet,company from monthly_taxi_fleet;

[upgrad@cdh-vm ~]$ hdfs dfs -ls -R /user/hive/warehouse/monthly_taxi_fleet6  

-rwxrwxrwt   1 upgrad hive  13611 2017-12-26 10:43 /user/hive/warehouse/monthly_taxi_fleet6/000000_0  

-rwxrwxrwt   1 upgrad hive   6077 2017-12-26 10:43 /user/hive/warehouse/monthly_taxi_fleet6/000001_0  

-rwxrwxrwt   1 upgrad hive   6589 2017-12-26 10:43 /user/hive/warehouse/monthly_taxi_fleet6/000002_0

0: jdbc:hive2://cdh-vm.dbaglobe.com:10000/def> describe extended monthly_taxi_fleet6;  

+—————————–+—————————————————-+———-+–+  

|      col_name       |                 data_type                  | comment  |  

+—————————–+—————————————————-+———-+–+  

| month                   | char(7)                                        |      |  

| fleet                   | int                                            |      |  

| company                 | varchar(50)                                    |      |  

|                         | NULL                                           | NULL |  

| Detailed Table Information  | Table(tableName:monthly_taxi_fleet6, dbName:default, owner:upgrad, createTime:1514256031, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:month, type:char(7), comment:null), FieldSchema(name:fleet, type:smallint, comment:null), FieldSchema(name:company, type:varchar(50), comment:null)], location:hdfs://cdh-vm.dbaglobe.com:8020/user/hive/warehouse/monthly_taxi_fleet6, inputFormat:org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat, compressed:false, numBuckets:3, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.avro.AvroSerDe, parameters:{serialization.format=1}), bucketCols:[company], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{totalSize=26277, numRows=1128, rawDataSize=0, COLUMN_STATS_ACCURATE=true, numFiles=3, transient_lastDdlTime=1514256192}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE) |      |  

+—————————–+—————————————————-+———-+–+  

5 rows selected (0.075 seconds)

Checkout: Basic Hive Interview Questions

Bucketing in Hive: Example #3

Below is a little advanced example of bucketing in Hive. Here, we have performed partitioning and used the Sorted By functionality to make the data more accessible. This is among the biggest advantages of bucketing. You can use it with other functions to manage large datasets more efficiently and effectively. 

0: jdbc:hive2://cdh-vm.dbaglobe.com:10000/def> create table monthly_taxi_fleet7  

. . . . . . . . . . . . . . . . . . . . . . .> (month char(7),fleet smallint)  

. . . . . . . . . . . . . . . . . . . . . . .> partitioned by (company varchar(50))  

. . . . . . . . . . . . . . . . . . . . . . .> clustered by (month) sorted by (month)into 3 buckets

. . . . . . . . . . . . . . . . . . . . . . .> stored as avro;

0: jdbc:hive2://cdh-vm.dbaglobe.com:10000/def> insert into monthly_taxi_fleet7  

. . . . . . . . . . . . . . . . . . . . . . .> partition (company)  

. . . . . . . . . . . . . . . . . . . . . . .> select month,fleet,company from monthly_taxi_fleet;

[upgrad@cdh-vm ~]$ hdfs dfs -ls -R /user/hive/warehouse/monthly_taxi_fleet7  

drwxrwxrwt   – upgrad hive      0 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=CityCab  

-rwxrwxrwt   1 upgrad hive    865 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=CityCab/000000_0  

-rwxrwxrwt   1 upgrad hive    865 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=CityCab/000001_0  

-rwxrwxrwt   1 upgrad hive    865 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=CityCab/000002_0  

drwxrwxrwt   – upgrad hive      0 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=Comfort  

-rwxrwxrwt   1 upgrad hive    913 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=Comfort/000000_0  

-rwxrwxrwt   1 upgrad hive    913 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=Comfort/000001_0  

-rwxrwxrwt   1 upgrad hive    913 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=Comfort/000002_0  

drwxrwxrwt   – upgrad hive      0 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=Individual Yellow- Top  

-rwxrwxrwt   1 upgrad hive    865 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=Individual Yellow- Top/000000_0  

-rwxrwxrwt   1 upgrad hive    865 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=Individual Yellow- Top/000001_0  

-rwxrwxrwt   1 upgrad hive    865 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=Individual Yellow- Top/000002_0  

drwxrwxrwt   – upgrad hive      0 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=Premier  

-rwxrwxrwt   1 upgrad hive    865 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=Premier/000000_0  

-rwxrwxrwt   1 upgrad hive    865 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=Premier/000001_0  

-rwxrwxrwt   1 upgrad hive    865 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=Premier/000002_0  

drwxrwxrwt   – upgrad hive      0 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=Prime  

-rwxrwxrwt   1 upgrad hive    765 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=Prime/000000_0  

-rwxrwxrwt   1 upgrad hive    765 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=Prime/000001_0  

-rwxrwxrwt   1 upgrad hive    766 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=Prime/000002_0  

drwxrwxrwt   – upgrad hive      0 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=SMRT  

-rwxrwxrwt   1 upgrad hive    865 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=SMRT/000000_0  

-rwxrwxrwt   1 upgrad hive    865 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=SMRT/000001_0  

-rwxrwxrwt   1 upgrad hive    865 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=SMRT/000002_0  

drwxrwxrwt   – upgrad hive      0 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=Smart  

-rwxrwxrwt   1 upgrad hive    720 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=Smart/000000_0  

-rwxrwxrwt   1 upgrad hive    719 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=Smart/000001_0  

-rwxrwxrwt   1 upgrad hive    719 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=Smart/000002_0  

drwxrwxrwt   – upgrad hive      0 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=TransCab  

-rwxrwxrwt   1 upgrad hive    865 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=TransCab/000000_0  

-rwxrwxrwt   1 upgrad hive    865 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=TransCab/000001_0  

-rwxrwxrwt   1 upgrad hive    865 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=TransCab/000002_0  

drwxrwxrwt   – upgrad hive      0 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=YTC  

-rwxrwxrwt   1 upgrad hive    432 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=YTC/000000_0  

-rwxrwxrwt   1 upgrad hive    432 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=YTC/000001_0  

-rwxrwxrwt   1 upgrad hive    432 2017-12-26 11:05 /user/hive/warehouse/monthly_taxi_fleet7/company=YTC/000002_0

Learn More About Partitioning and Bucketing in Hive

In the examples we shared before, we performed partitioning and bucketing in Hive in multiple ways and learned about how you can implement them in Hive. However, Apache Hive has many other functionalities and learning about all of them can be quite daunting. 

That’s why we recommend taking a data engineering course. It would allow you to study from industry experts who have spent years in this industry. A course provides you with a structured curriculum where you learn everything steps by step. At upGrad, we offer dedicated data engineering courses.

With our courses, you get access to upGrad’s Student Success Corner where you get personalized resume feedback, interview preparation, career counselling, and many other advantages. 

After the course completion, you’ll be a skilled data engineering professional. 

Conclusion

Bucketing in Hive is very simple and easy to perform. It is certainly a useful function for large datasets. However, when you perform both partitioning and bucketing in Hive together, you can manage quite humongous datasets very easily. 

If you are interested to know more about Big Data, check out our Advanced Certificate Programme in Big Data from IIIT Bangalore.

If you have any questions or thoughts regarding bucketing, do share them in the comments below. We’d love to hear from you. 

Check our other Software Engineering Courses at upGrad.

Frequently Asked Questions (FAQs)

1. What is the variance between Bucketing and Partitioning in Hive?

Bucketing decomposes data into manageable parts. Multiple small partitions based on column values can be created with partitioning. In bucketing, you restrict the number of buckets to store the data. Partitioning data is often used for distributing load horizontally; this has performance benefits and helps in organising data in a logical way. Both Partitioning and Bucketing in Hive deal with a large data set and are used to improve performance by eliminating table scans. Bucketing is considered useful in situations where the field has high cardinality, while partitioning is suitable when the cardinality is not very high.

2. What are the advantages of Bucketing in Hive?

Bucketing in Hive is the concept of breaking data down into ranges known as buckets. Hive Bucketing provides a faster query response. Due to equal volumes of data in each partition, joins at the Map side will be quicker. Bucketed tables allow faster execution of map side joins, as data is stored in equal-sized buckets. Also, efficient sampling happens for bucketed tables when compared to non-bucketed ones. Bucketing also improves performance by shuffling and sorting data prior to downstream operations such as table joins.

3. What is the Hive Metastore?

The Hive Metastore is simply a relational database. It stores metadata related to the tables you create to easily query Big Data stored in the Hadoop Distributed File System (HDFS). When a new Hive table is created, the information related to the schema (column names, data types) is stored in the Hive metastore relational database. The Hive metastore acts as a central schema repository which can be used by other access tools for Hive metadata. It is the central repository. The metadata is stored for Hive tables and partitions.