Posts

SPARK - Optimization Techniques

Image
Apache Spark   is a well known Big Data Processing Engine out in market right now. It helps in lots of use cases, right from real time processing (Spark Streaming) till Graph processing (GraphX). As an organization, investment on Spark technology has become an inevitable move. The craze for the adoption is due to two major factors, Big community and good integration with other well known projects Ease of use (APIs are pretty simple and elegant to use) Optimizations Lets jump to the optimizations, Narrow transformations than Wide transformations Use columnar data formats for structured data Partition the data at Source for better overall performance Converting Dataframe to RDD is expensive — avoid it at all costs Use  broadcast  hint for all smaller table joins Use well informed  Spark Config Parameters  for your program Narrow transformations than Wide transformations Shuffling  means moving the data across the partitions — sometimes across the nodes as wel...

MongoDB - Introduction

MongoDB is a cross-platform, document oriented database that provides, high performance, high availability, and easy scalability. MongoDB works on concept of collection and document. Database Database is a physical container for collections. Each database gets its own set of files on the file system. A single MongoDB server typically has multiple databases. Collection Collection is a group of MongoDB documents. It is the equivalent of an RDBMS table. A collection exists within a single database. Collections do not enforce a schema. Documents within a collection can have different fields. Typically, all documents in a collection are of similar or related purpose. Document A document is a set of key-value pairs. Documents have dynamic schema. Dynamic schema means that documents in the same collection do not need to have the same set of fields or structure, and common fields in a collection's documents may hold different types of data. The following table shows the relationship of RDB...

SPARK - Core Programming

Image
Spark Core is the base of the whole project. It provides distributed task dispatching, scheduling, and basic I/O functionalities. Spark uses a specialized fundamental data structure known as RDD (Resilient Distributed Datasets) that is a logical collection of data partitioned across machines. RDDs can be created in two ways; one is by referencing datasets in external storage systems and second is by applying transformations (e.g. map, filter, reducer, join) on existing RDDs. The RDD abstraction is exposed through a language-integrated API. This simplifies programming complexity because the way applications manipulate RDDs is similar to manipulating local collections of data. Spark Shell Spark provides an interactive shell − a powerful tool to analyze data interactively. It is available in either Scala or Python language. Spark’s primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop Input Formats (such as H...