Spark interview question part-3

 1. What happens to RDD when one of the nodes on which it is distributed goes down?

Since Spark knows how to prepare a certain data set because it is aware of various transformations and actions that have lead to the dataset, it will be able to apply the same transformations and actions to prepare the lost partition of the node which has gone down.

2. How to save RDD?

There are few methods provided by Spark:

  • saveAsTextFile: Write the elements of the RDD as a text file (or set of text files) to the provided directory. The directory could be in the local filesystem, HDFS or any other file system. Each element of the dataset will be converted to text using toString() method on every element. And each element will be appended with newline character “\n”
  • saveAsSequenceFile: Write the elements of the dataset as a Hadoop SequenceFile. This works only on the key-value pair RDD which implement Hadoop’s Writeable interface. You can load sequence file using sc.sequenceFile().
  • saveAsObjectFile: This simply saves data by serializing using standard java object serialization.

3. What do we mean by Paraquet?

Apache Paraquet is a columnar format for storage of data available in Hadoop ecosystem. It is space efficient storage format which can be used in any programming language and framework.

Apache Spark supports reading and writing data in Paraquet format.

4.  What does it mean by Columnar Storage Format?

While converting a tabular or structured data into the stream of bytes we can either store row-wise or we could store column-wise.

In row-wise, we first store the first row and then store the second row and so on. In column-wise, we first store first column and second column.

Spark Interview Questions - Columnar Storage Format

5. When creating an RDD, what goes on internally?

There are two ways to create RDD. One while loading data from a source. Second, by operating on existing RDD. And an action causes the computation from an RDD to yield the result. The diagram below shows the relationship between RDD, transformations, actions and value/result.

Spark Interview Questions - Working With RDD
  • While loading Data from Source – When an RDD is prepared by loading data from some source (HDFS, Cassandra, in-memory), the machines which exist nearer to the data are assigned for the creation of partitions. These partitions would hold the parts of mappings or pointers to the actual data. When we are loading data from the memory (for example, by using parallelize), the partitions would hold the actual data instead of pointers or mapping
  • By converting an in-memory array of objects – An in-memory object can be converted to an RDD using parallelize.
  • By operating on existing RDD – An RDD is immutable. We can’t change an existing RDD. We can only form a new RDD based on the previous RDD by operating on it. When operating on existing RDD, a new RDD is formed. These operations are also called transformations. The operation could also result in shuffling – moving data across the nodes. Some operations that do not cause shuffling: map, flatMap and filter. Examples of the operations that could result in shuffling are groupByKey, repartition, sortByKey, aggregateByKey, reduceByKey, distinct.Spark maintains the relationship between the RDD in the form of a DAG (Directed Acyclic Graph). When an action such reduce() or saveAsTextFile() is called, the whole graph is evaluated and the result is returned to the driver or saved to the location such as HDFS.

6. What do we mean by Partitions or slices?

Partitions (also known as slices earlier) are the parts of RDD. Each partition is generally located on a different machine. Spark runs a task for each partition during the computation.

If you are loading data from HDFS using textFile(), it would create one partition per block of HDFS(64MB typically). Though you can change the number of partitions by specifying the second argument in the textFile() function.

If you are loading data from an existing memory using sc.parallelize(), you can enforce your number of partitions by passing the second argument.

You can change the number of partitions later using repartition().

If you want certain operations to consume the whole partitions at a time, you can use map partition().

7. What is meant by Transformation? Give some examples.

The transformations are the functions that are applied on an RDD (resilient distributed dataset). The transformation results in another RDD. A transformation is not executed until an action follows.

Some examples of transformation are:

  1. map() – applies the function passed to it on each element of RDD resulting in a new RDD.
  2. filter() – creates a new RDD by picking the elements from the current RDD which pass the function provided as an argument

8. What does map transformation do? Provide an example.

Map transformation on an RDD produces another RDD by translating each element. It translates each element by executing the function provided by the user.

9. What is the difference between map and flatMap?

Map and flatMap both functions are applied to each element of RDD. The only difference is that the function that is applied as part of the map must return only one value while flatMap can return a list of values.So, flatMap can convert one element into multiple elements of RDD while map can only result in an equal number of elements.

So, flatMap can convert one element into multiple elements of RDD while map can only result in an equal number of elements.

So, if we are loading RDD from a text file, each element is a sentence. To convert this RDD into an RDD of words, we will have to apply using flatMap a function that would split a string into an array of words. If we have just to clean up each sentence or change case of each sentence, we would be using the map instead of flatMap. See the diagram below.

Spark Interview Questions - flatMap

10. What are Actions? Give some examples.

An action brings back the data from the RDD to the local machine. Execution of an action results in all the previously created transformation. The example of actions are:

  • reduce() – executes the function passed again and again until only one value is left. The function should take two arguments and return one value.
  • take() – take all the values back to the local node from RDD.

Comments

Popular posts from this blog

MongoDB - Data Modelling

SPARK - Deployment

SQOOP