SPARK - Optimization Techniques
Apache Spark is a well known Big Data Processing Engine out in market right now. It helps in lots of use cases, right from real time processing (Spark Streaming) till Graph processing (GraphX). As an organization, investment on Spark technology has become an inevitable move. The craze for the adoption is due to two major factors, Big community and good integration with other well known projects Ease of use (APIs are pretty simple and elegant to use) Optimizations Lets jump to the optimizations, Narrow transformations than Wide transformations Use columnar data formats for structured data Partition the data at Source for better overall performance Converting Dataframe to RDD is expensive — avoid it at all costs Use broadcast hint for all smaller table joins Use well informed Spark Config Parameters for your program Narrow transformations than Wide transformations Shuffling means moving the data across the partitions — sometimes across the nodes as wel...