Spark interview question part-3

1. What happens to RDD when one of the nodes on which it is distributed goes down?

Since Spark knows how to prepare a certain data set because it is aware of various transformations and actions that have lead to the dataset, it will be able to apply the same transformations and actions to prepare the lost partition of the node which has gone down.

2. How to save RDD?

There are few methods provided by Spark:

saveAsTextFile: Write the elements of the RDD as a text file (or set of text files) to the provided directory. The directory could be in the local filesystem, HDFS or any other file system. Each element of the dataset will be converted to text using toString() method on every element. And each element will be appended with newline character “\n”
saveAsSequenceFile: Write the elements of the dataset as a Hadoop SequenceFile. This works only on the key-value pair RDD which implement Hadoop’s Writeable interface. You can load sequence file using sc.sequenceFile().
saveAsObjectFile: This simply saves data by serializing using standard java object serialization.

3. What do we mean by Paraquet?

Apache Paraquet is a columnar format for storage of data available in Hadoop ecosystem. It is space efficient storage format which can be used in any programming language and framework.

Apache Spark supports reading and writing data in Paraquet format.

4. What does it mean by Columnar Storage Format?

While converting a tabular or structured data into the stream of bytes we can either store row-wise or we could store column-wise.

In row-wise, we first store the first row and then store the second row and so on. In column-wise, we first store first column and second column.

Spark Interview Questions - Columnar Storage Format

Spark Interview Questions - Working With RDD

Search This Blog

Kishan’s big data world

Spark interview question part-3

Comments

Post a Comment

Popular posts from this blog

Spark interview question part-1

MongoDB - Data Modelling

SPARK - Deployment