BUCKETING IN HIVE

What is Bucketing in Hive ?

Basically, for decomposing table data sets into more manageable parts, Apache Hive offers another technique. That technique is what we call Bucketing in Hive.

Why Bucketing?

Basically, the concept of Hive Partitioning provides a way of segregating hive table data into multiple files/directories. However, it only gives effective results in few scenarios. Such as:

– When there is the limited number of partitions.
– Or, while partitions are of comparatively equal size.

Although, it is not possible in all scenarios. For example when are partitioning our tables based geographic locations like country. Hence, some bigger countries will have large partitions (ex: 4-5 countries itself contributing 70-80% of total data).

While small countries data will create small partitions (remaining all countries in the world may contribute to just 20-30 % of total data). Hence, at that time Partitioning will not be ideal.

Then, to solve that problem of over partitioning, Hive offers Bucketing concept. It is another effective technique for decomposing table data sets into more manageable parts.

Features of Bucketing in Hive

Basically, this concept is based on hashing function on the bucketed column. Along with mod (by the total number of buckets). 

i. Where the hash_function depends on the type of the bucketing column.
ii. However, the Records with the same bucketed column will always be stored in the same bucket.
iii. Moreover,  to divide the table into buckets we use CLUSTERED BY clause.
iv. Generally, in the table directory, each bucket is just a file, and Bucket numbering is 1-based.
v. Along with Partitioning on Hive tables bucketing can be done and even without partitioning.
vi. Moreover, Bucketed tables will create almost equally distributed data file parts.

Advantages of Bucketing in Hive

i. On comparing with non-bucketed tables, Bucketed tables offer the efficient sampling.
ii. Map-side joins will be faster on bucketed tables than non-bucketed tables, as the data files are equal sized parts.
iii. Here also bucketed tables offer faster query responses than non-bucketed tables as compared to  Similar to partitioning.
iv. This concept offers the flexibility to keep the records in each bucket to be sorted by one or more columns.
v. Since the join of each bucket becomes an efficient merge-sort, this makes map-side joins even more efficient.

Limitations of Bucketing in Hive

i. However, it doesn’t ensure that the table is properly populated.
ii. So, we need to handle Data Loading into buckets by our-self.


Comments

Popular posts from this blog

SPARK- DATAFRAME DSL

SQOOP

MongoDB - Data Modelling