Spark interview question part-1

 1. What is Apache Spark and what are the benefits of Spark over MapReduce?

  • Spark is really fast. If run in-memory it is 100x faster than Hadoop MapReduce.
  • In Hadoop MapReduce, you write many MapReduce jobs and then tie these jobs together using Oozie/shell script. This mechanism is very time consuming and MapReduce tasks have heavy latency. Between two consecutive MapReduce jobs, the data has to be written to HDFS and read from HDFS. This is time-consuming. In case of Spark, this is avoided using RDDs and utilizing memory (RAM). And quite often, translating the output of one MapReduce job into the input of another MapReduce job might require writing another code because Oozie may not suffice.
  • In Spark, you can basically do everything from single code or console (PySpark or Scala console) and get the results immediately. Switching between ‘Running something on cluster’ and ‘doing something locally’ is fairly easy and straightforward. This also leads to less context switch of the developer and more productivity.
  • Spark kind of equals to MapReduce and Oozie put together.

2. Is there any point of learning MapReduce, then?

  • MapReduce is a paradigm used by many big data tools including Spark. So, understanding the MapReduce paradigm and how to convert a problem into series of MapReduce tasks is very important.
  • Many organizations have already written a lot of code in MapReduce. For legacy reasons, it is required.
  • Almost, every other tool such as Hive or Pig converts its query into MapReduce phases. If you understand the MapReduce then you will be able to optimize your queries better.

3. What are the downsides of Spark?

Spark utilizes the memory. So, in a shared environment, it might consume little more memory for longer durations.

The developer has to be careful. A casual developer might make following mistakes:

  • She may end up running everything on the local node instead of distributing work over to the cluster.
  • She might hit some web service too many times by the way of using multiple clusters.

The first problem is well tackled by Hadoop MapReduce paradigm as it ensures that the data your code is churning is fairly small a point of time thus you can make a mistake of trying to handle whole data on a single node.

The second mistake is possible in MapReduce too. While writing MapReduce, a user may hit a service from inside of map()or reduce() too many times. This overloading of service is also possible while using Spark.

4. Explain in brief what is the architecture of Spark?

Spark Interview Questions - Spark Architecture
Spark Interview Questions – Spark Architecture

At the architecture level, from a macro perspective, the Spark might look like this:

Spark Architecture

5) Interactive Shells or Job Submission Layer
4) API Binding: Python, Java, Scala, R, SQL
3) Libraries: MLLib, GraphX, Spark Streaming
2) Spark Core (RDD & Operations on it)
1) Spark Driver -> Executor
0) Scheduler or Resource Manager

0) Scheduler or Resource Manager:

At the bottom is the resource manager. This resource manager could be external such YARN or Mesos. Or it could be internal if the Spark is running in standalone mode. The role of this layer is to provide a playground in which the program can run distributively. For example, YARN (Yet Another Resource Manager) would create application master, executors for any process.

1) Spark Driver -> Executor:

One level above scheduler is the actual code by the Spark which talks to the scheduler to execute. This piece of code does the real work of execution. The Spark Driver that would run inside the application master is part of this layer. Spark Driver dictates what to execute and executor executes the logic.

2) Spark Core (RDD & Operations on it):

Spark Core is the layer which provides maximum functionality. This layer provides abstract concepts such as RDD and the execution of the transformation and actions.

3) Libraries: MLLib,, GraphX, Spark Streaming, Dataframes:

The additional vertical wise functionalities on top of Spark Core is provided by various libraries such as MLLib, Spark Streaming, GraphX, Dataframes or SparkSQL etc.

4) API Bindings are internally calling the same API from different languages.

5) Interactive Shells or Job Submission Layer:

The job submission APIs provide a way to submit bundled code. It also provides interactive programs (PySpark, SparkR etc.) that are also called REPL or Read-Eval-Print-Loop to process data interactively.

5. On which all platform can Apache Spark run?

Spark can run on the following platforms:

  • YARN (Hadoop): Since yarn can handle any kind of workload, the spark can run on Yarn. Though there are two modes of execution. One in which the Spark driver is executed inside the container on node and second in which the Spark driver is executed on the client machine. This is the most common way of using Spark.
  • Apache Mesos: Mesos is an open source good upcoming resource manager. Spark can run on Mesos.
  • EC2: If you do not want to manage the hardware by yourself, you can run the Spark on top of Amazon EC2. This makes spark suitable for various organizations.
  • Standalone: If you have no resource manager installed in your organization, you can use the standalone way. Basically, Spark provides its own resource manager. All you have to do is install Spark on all nodes in a cluster, inform each node about all nodes and start the cluster. It starts communicating with each other and run.

6. What are the various programming languages supported by Spark?

Though Spark is written in Scala, it lets the users code in various languages such as:

  • Scala
  • Java
  • Python
  • R (Using SparkR)
  • SQL (Using SparkSQL)

Also, by the way of piping the data via other commands, we should be able to use all kinds of programming languages or binaries.

7. What are the various modes in which Spark runs on YARN? (Local vs Client vs Cluster Mode)

Apache Spark has two basic parts:

  1. Spark Driver: Which controls what to execute where
  2. Executor: Which actually executes the logic

While running Spark on YARN, though it is very obvious that executor will run inside containers, the driver could be run either on the machine which user is using or inside one of the containers. The first one is known as Yarn client mode while second is known as Cluster-Mode. The following diagrams should give you a good idea:

YARN client mode: The driver is running on the machine from which client is connected

Spark Interview Questions - Spark RDD Client Mode
Spark Interview Questions – Spark RDD Client Mode

YARN cluster mode: The driver runs inside the cluster.

Spark Interview Questions - Spark RDD Cluster ModeLocal mode: It is only for the case when you do not want to use a cluster and instead want to run everything on a single machine. So Driver Application and Spark Application are both on the same machine as the user.

8. What are the various storages from which Spark can read data?

Spark has been designed to process data from various sources. So, whether you want to process data stored in HDFS, Cassandra, EC2, Hive, HBase, and Alluxio (previously Tachyon). Also, it can read data from any system that supports any Hadoop data source.

9. While processing data from HDFS, does it execute code near data?

Yes, it does in most cases. It creates the executors near the nodes that contain data.

10. What are the various libraries available on top of Apache Spark?

  • MLlib: It is machine learning library provided by Spark. It basically has all the algorithms that internal are wired to use Spark Core (RDD Operations) and the data structures required. For example, it provides ways to translate the Matrix into RDD and recommendation algorithms into sequences of transformations and actions. MLLib provides the machine learning algorithms that can run parallelly on many computers.
  • GraphX: GraphX provides libraries which help in manipulating huge graph data structures. It converts graphs into RDD internally. Various algorithms such PageRank on graphs are internally converted into operations on RDD.
  • Spark Streaming: It is a very simple library that listens on unbounded data sets or the datasets where data is continuously flowing. The processing pauses and waits for data to come if the source isn’t providing data. This library converts the incoming data streaming into RDDs for the “n” seconds collected data aka batch of data and then run the provided operations on the RDDs.

Comments

Popular posts from this blog

MongoDB - Data Modelling

SPARK - Deployment

SQOOP