HIVE INTRO

 

What is Hive?

Hive is an ETL and Data warehousing tool developed on top of Hadoop Distributed File System (HDFS). Hive makes job easy for performing operations like 

  • Data encapsulation
  • Ad-hoc queries
  • Analysis of huge datasets

Important characteristics of Hive

  • In Hive, tables and databases are created first and then data is loaded into these tables.
  • Hive as data warehouse designed for managing and querying only structured data that is stored in tables.
  • While dealing with structured data, Map Reduce doesn’t have optimization and usability features like UDFs but Hive framework does. Query optimization refers to an effective way of query execution in terms of performance.
  • Hive’s SQL-inspired language separates the user from the complexity of Map Reduce programming. It reuses familiar concepts from the relational database world, such as tables, rows, columns and schema, etc. for ease of learning.
  • Hadoop’s programming works on flat files. So, Hive can use directory structures to “partition” data to improve performance on certain queries.
  • A new and important component of Hive i.e. Metastore used for storing schema information. This Metastore typically resides in a relational database. We can interact with Hive using methods like
    • Web GUI 
    • Java Database Connectivity (JDBC) interface
  • Most interactions tend to take place over a command line interface (CLI). Hive provides a CLI to write Hive queries using Hive Query Language(HQL)
  • Generally, HQL syntax is similar to the SQL syntax that most data analysts are familiar with. The Sample query below display all the records present in mentioned table name.
    • Sample query : Select * from <TableName>
  • Hive supports four file formats those are TEXTFILE, SEQUENCEFILE, ORC and RCFILE (Record Columnar File).
  • For single user metadata storage, Hive uses derby database and for multiple user Metadata or shared Metadata case Hive uses MYSQL.

For setting up MySQL as database and to store Meta-data information check Tutorial “Installation and Configuration of HIVE and MYSQL”

Some of the key points about Hive:

  • The major difference between HQL and SQL is that Hive query executes on Hadoop’s infrastructure rather than the traditional database. 
  • The Hive query execution is going to be like series of automatically generated map reduce Jobs. 
  • Hive supports partition and buckets concepts for easy retrieval of data when the client executes the query. 
  • Hive supports custom specific UDF (User Defined Functions) for data cleansing, filtering, etc. According to the requirements of the programmers one can define Hive UDFs.

Hive Vs Relational Databases:-

By using Hive, we can perform some peculiar functionality that is not achieved in Relational Databases. For a huge amount of data that is in peta-bytes, querying it and getting results in seconds is important. And Hive does this quite efficiently, it processes the queries fast and produce results in second’s time. 

Let see now what makes Hive so fast.

Relational databases are of “Schema on READ and Schema on Write“. First creating a table then inserting data into the particular table. On relational database tables, functions like Insertions, Updates, and Modifications can be performed.

Hive is “Schema on READ only“. So, functions like the update, modifications, etc. don’t work with this. Because the Hive query in a typical cluster runs on multiple Data Nodes. So it is not possible to update and modify data across multiple nodes.( Hive versions below 0.13) 

Also, Hive supports “READ Many WRITE Once” pattern. Which means that after inserting table we can update the table in the latest Hive versions.

Comments

Popular posts from this blog

SPARK- DATAFRAME DSL

SQOOP

MongoDB - Data Modelling