Posts

Showing posts from July, 2022

BUCKETING IN HIVE

What is Bucketing in Hive ? Basically, for decomposing table data sets into more manageable parts, Apache Hive offers another technique. That technique is what we call Bucketing in Hive. Why Bucketing? Basically, the concept of  Hive Partitioning  provides a way of segregating hive table data into multiple files/directories. However, it only gives effective results in few scenarios. Such as: – When there is the limited number of partitions. – Or, while partitions are of comparatively equal size. Although, it is not possible in all scenarios.  For example when are partitioning our tables based geographic locations like country. Hence, some bigger countries will have large partitions (ex: 4-5 countries itself contributing 70-80% of total data). While small countries data will create small partitions (remaining all countries in the world may contribute to just 20-30 % of total data). Hence, at that time Partitioning will not be ideal. Then, to solve that problem of over part...

HAVING VS WHERE

Image
  Difference between WHERE and HAVING The WHERE and HAVING clauses are discussed in depth in this article. They're also used to filter records in SQL queries. The difference between the WHERE and HAVING clause is the most common question posed during an interview time.  The main difference between them is that the WHERE clause is used to specify a condition for filtering records before any groupings are made, while the HAVING clause is used to specify a condition for filtering values from a group.  Before making the comparison, we will first know these sql clauses. WHERE Clause The WHERE clause in MySQL is used with select, Insert, update, and delete queries to filter data from the table or relation. It describes a specific condition when retrieving records from a single table or multiple tables using the join. If the specified condition is satisfied, it returns the particular value from the table. The where conditions on the s...

Hive Managed table vs external table

 Managed Tables VS External Tables Managed tables A managed table is stored under the  hive.metastore.warehouse.dir  path property, by default in a folder path similar to  /user/hive/warehouse/databasename.db/tablename/ . The default location can be overridden by the  location  property during table creation. If a managed table or partition is dropped, the data and metadata associated with that table or partition are deleted. If the PURGE option is not specified, the data is moved to a trash folder for a defined duration. Use managed tables when Hive should manage the lifecycle of the table, or when generating temporary tables. External tables An external table describes the metadata / schema on external files. External table files can be accessed and managed by processes outside of Hive. External tables can access data stored in sources such as Azure Storage Volumes (ASV) or remote HDFS locations. If the structure or partitioning of an external table is ch...

HIVE INTRO

  What is Hive? Hive is an ETL and Data warehousing tool developed on top of Hadoop Distributed File System (HDFS). Hive makes job easy for performing operations like  Data encapsulation Ad-hoc queries Analysis of huge datasets Important characteristics of Hive In Hive, tables and databases are created first and then data is loaded into these tables. Hive as data warehouse designed for managing and querying only structured data that is stored in tables. While dealing with structured data, Map Reduce doesn’t have optimization and usability features like UDFs but Hive framework does. Query optimization refers to an effective way of query execution in terms of performance. Hive’s SQL-inspired language separates the user from the complexity of Map Reduce programming. It reuses familiar concepts from the relational database world, such as tables, rows, columns and schema, etc. for ease of learning. Hadoop’s programming works on flat files. So, Hive can use directory structures to “...

SQOOP

  What is Apache Sqoop? Many of us still wonder what Apache Sqoop is, its architecture, features, uses, and how it is related to big data. In this Sqoop write up, we will talk about everything along with its requirements. Let’s get started! Apache Sqoop is a big data tool for transferring data between Hadoop and relational database servers. Sqoop is used to transfer data from RDBMS (relational database management system) like MySQL and Oracle to  HDFS  (Hadoop Distributed File System). Big Data Sqoop can also be used to transform data in Hadoop MapReduce and then export it into RDBMS. Sqoop is a data collection and ingestion tool used to import and export data between RDBMS and HDFS. SQOOP = SQL + HADOOP Why do we need Big Data Sqoop? Sqoop Big Data Tool is primarily used for bulk data transfer to and from relational databases or mainframes. Sqoop in Big Data can import from entire tables or allow the user to specify predicates to restrict data selection. You can write di...