1. What is Apache Spark and what are the benefits of Spark over MapReduce?
Spark is really fast. If run in-memory it is 100x faster than Hadoop MapReduce.
In Hadoop MapReduce, you write many MapReduce jobs and then tie these jobs together using Oozie/shell script. This mechanism is very time consuming and MapReduce tasks have heavy latency. Between two consecutive MapReduce jobs, the data has to be written to HDFS and read from HDFS. This is time-consuming. In case of Spark, this is avoided using RDDs and utilizing memory (RAM). And quite often, translating the output of one MapReduce job into the input of another MapReduce job might require writing another code because Oozie may not suffice.
In Spark, you can basically do everything from single code or console (PySpark or Scala console) and get the results immediately. Switching between ‘Running something on cluster’ and ‘doing something locally’ is fairly easy and straightforward. This also leads to less context switch of the developer and more productivity.
Spark kind of equals to MapReduce and Oozie put together.
2. Is there any point of learning MapReduce, then?
MapReduce is a paradigm used by many big data tools including Spark. So, understanding the MapReduce paradigm and how to convert a problem into series of MapReduce tasks is very important.
Many organizations have already written a lot of code in MapReduce. For legacy reasons, it is required.
Almost, every other tool such as Hive or Pig converts its query into MapReduce phases. If you understand the MapReduce then you will be able to optimize your queries better.
3. What are the downsides of Spark?
Spark utilizes the memory. So, in a shared environment, it might consume little more memory for longer durations.
The developer has to be careful. A casual developer might make following mistakes:
She may end up running everything on the local node instead of distributing work over to the cluster.
She might hit some web service too many times by the way of using multiple clusters.
The first problem is well tackled by Hadoop MapReduce paradigm as it ensures that the data your code is churning is fairly small a point of time thus you can make a mistake of trying to handle whole data on a single node.
The second mistake is possible in MapReduce too. While writing MapReduce, a user may hit a service from inside of map()or reduce() too many times. This overloading of service is also possible while using Spark.
4. Explain in brief what is the architecture of Spark?
At the architecture level, from a macro perspective, the Spark might look like this:
Spark Architecture
5) Interactive Shells or Job Submission Layer
4) API Binding: Python, Java, Scala, R, SQL
3) Libraries: MLLib, GraphX, Spark Streaming
2) Spark Core (RDD & Operations on it)
1) Spark Driver -> Executor
0) Scheduler or Resource Manager
0) Scheduler or Resource Manager:
At the bottom is the resource manager. This resource manager could be external such YARN or Mesos. Or it could be internal if the Spark is running in standalone mode. The role of this layer is to provide a playground in which the program can run distributively. For example, YARN (Yet Another Resource Manager) would create application master, executors for any process.
1) Spark Driver -> Executor:
One level above scheduler is the actual code by the Spark which talks to the scheduler to execute. This piece of code does the real work of execution. The Spark Driver that would run inside the application master is part of this layer. Spark Driver dictates what to execute and executor executes the logic.
2) Spark Core (RDD & Operations on it):
Spark Core is the layer which provides maximum functionality. This layer provides abstract concepts such as RDD and the execution of the transformation and actions.
The additional vertical wise functionalities on top of Spark Core is provided by various libraries such as MLLib, Spark Streaming, GraphX, Dataframes or SparkSQL etc.
4) API Bindings are internally calling the same API from different languages.
5) Interactive Shells or Job Submission Layer:
The job submission APIs provide a way to submit bundled code. It also provides interactive programs (PySpark, SparkR etc.) that are also called REPL or Read-Eval-Print-Loop to process data interactively.
5. On which all platform can Apache Spark run?
Spark can run on the following platforms:
YARN (Hadoop): Since yarn can handle any kind of workload, the spark can run on Yarn. Though there are two modes of execution. One in which the Spark driver is executed inside the container on node and second in which the Spark driver is executed on the client machine. This is the most common way of using Spark.
Apache Mesos: Mesos is an open source good upcoming resource manager. Spark can run on Mesos.
EC2: If you do not want to manage the hardware by yourself, you can run the Spark on top of Amazon EC2. This makes spark suitable for various organizations.
Standalone: If you have no resource manager installed in your organization, you can use the standalone way. Basically, Spark provides its own resource manager. All you have to do is install Spark on all nodes in a cluster, inform each node about all nodes and start the cluster. It starts communicating with each other and run.
6. What are the various programming languages supported by Spark?
Though Spark is written in Scala, it lets the users code in various languages such as:
Scala
Java
Python
R (Using SparkR)
SQL (Using SparkSQL)
Also, by the way of piping the data via other commands, we should be able to use all kinds of programming languages or binaries.
7. What are the various modes in which Spark runs on YARN? (Local vs Client vs Cluster Mode)
Apache Spark has two basic parts:
Spark Driver: Which controls what to execute where
Executor: Which actually executes the logic
While running Spark on YARN, though it is very obvious that executor will run inside containers, the driver could be run either on the machine which user is using or inside one of the containers. The first one is known as Yarn client mode while second is known as Cluster-Mode. The following diagrams should give you a good idea:
YARN client mode: The driver is running on the machine from which client is connected
YARN cluster mode: The driver runs inside the cluster.
Apache Spark is a lightning-fast cluster computing framework designed for fast computation. With the advent of real-time processing framework in the Big Data Ecosystem, companies are using Apache Spark rigorously in their solutions. Spark SQL is a new module in Spark which integrates relational processing with Spark’s functional programming API. It supports querying data either via SQL or via the Hive Query Language. Through this blog, I will introduce you to this new exciting domain of Spark SQL. What is Spark SQL? Spark SQL integrates relational processing with Spark’s functional programming. It provides support for various data sources and makes it possible to weave SQL queries with code transformations thus resulting in a very powerful tool. Why is Spark SQL used? Spark SQL originated as Apache Hive to run on top of Spark and is now integrated with the Spark stack. Apache Hive had certain limitations as mentioned below. Spark SQL was built to overcom...
Data in MongoDB has a flexible schema.documents in the same collection. They do not need to have the same set of fields or structure Common fields in a collection’s documents may hold different types of data. Data Model Design MongoDB provides two types of data models: — Embedded data model and Normalized data model. Based on the requirement, you can use either of the models while preparing your document. Embedded Data Model In this model, you can have (embed) all the related data in a single document, it is also known as de-normalized data model. For example, assume we are getting the details of employees in three different documents namely, Personal_details, Contact and, Address, you can embed all the three documents in a single one as shown below − { _id : , Emp_ID : "10025AE336" Personal_details :{ First_Name : "Kishan" , Last_Name : "choudhary" , Date_Of_Birth : "1995-09-26" }, Contact : { e - mail : ...
What is Apache Sqoop? Many of us still wonder what Apache Sqoop is, its architecture, features, uses, and how it is related to big data. In this Sqoop write up, we will talk about everything along with its requirements. Let’s get started! Apache Sqoop is a big data tool for transferring data between Hadoop and relational database servers. Sqoop is used to transfer data from RDBMS (relational database management system) like MySQL and Oracle to HDFS (Hadoop Distributed File System). Big Data Sqoop can also be used to transform data in Hadoop MapReduce and then export it into RDBMS. Sqoop is a data collection and ingestion tool used to import and export data between RDBMS and HDFS. SQOOP = SQL + HADOOP Why do we need Big Data Sqoop? Sqoop Big Data Tool is primarily used for bulk data transfer to and from relational databases or mainframes. Sqoop in Big Data can import from entire tables or allow the user to specify predicates to restrict data selection. You can write di...
Comments
Post a Comment