Spark interview question part-5

1. What is Immutable?

Ans: Once created and assign a value, it’s not possible to change, this property is called Immutability. Spark is by default immutable, it does not allow updates and modifications. Please note data collection is not immutable, but data value is immutable.

2. What is Distributed?

Ans: RDD can automatically the data is distributed across different parallel computing nodes.

3. What is Lazy evaluated?

Ans: If you execute a bunch of programs, it’s not mandatory to evaluate immediately. Especially in Transformations, this Laziness is a trigger.

4. What is Spark engine's responsibility?

Ans: Spark is responsible for scheduling, distributing, and monitoring the application across the cluster.

5. What are common Spark Ecosystems?

Ans:

Spark SQL(Shark) for SQL developers,
Spark Streaming for streaming data,
MLLib for machine learning algorithms,
GraphX for Graph computation,
SparkR to run R on Spark engine,
BlinkDB enabling interactive queries over massive data are common Spark ecosystems. GraphX, SparkR, and BlinkDB are in the incubation stage.

6. What are Partitions?

Ans: Partition is a logical division of the data, this idea is derived from Map-reduce (split). Logical data is specifically derived to process the data. Small chunks of data also it can support scalability and speed up the process. Input data, intermediate data, and output data everything is Partitioned RDD.

7. How does spark partition the data?

Ans: Spark uses map-reduce API to do the partition the data. In Input format, we can create a number of partitions. By default, HDFS block size is partition size (for best performance), but it’s possible to change partition size like Split.

8. How does Spark store the data?

Ans: Spark is a processing engine, there is no storage engine. It can retrieve data from any storage engine like HDFS, S3, and other data resources.

9. Is it mandatory to start Hadoop to run the spark application?

Ans: No not mandatory, but there is no separate storage in Spark, so it uses a local file system to store the data. You can load data from the local system and process it, Hadoop or HDFS is not mandatory to run spark application.

10. What is SparkContext?

Ans: When a programmer creates an RDDs, SparkContext connects to the Spark cluster to create a new SparkContext object. SparkContext tells spark how to access the cluster. SparkConf is a key factor to create a programmer application.

11. What are SparkCore functionalities?

Ans: SparkCore is a base engine of the apache spark framework. Memory management, fault tolerance, scheduling, and monitoring jobs, interacting with store systems are primary functionalities of Spark.

12. How SparkSQL is different from HQL and SQL?

Ans: SparkSQL is a special component on the spark core engine that supports SQL and HiveQueryLanguage without changing any syntax. It’s possible to join the SQL table and HQL table.

13. How Spark Streaming API works?

Ans: The programmer sets a specific time in the configuration, within this time how much data gets into the Spark, that data separates as a batch. The input stream (DStream) goes into spark streaming. The framework breaks up into small chunks called batches, then feeds into the spark engine for processing.

Spark Streaming API passes those batches to the core engine. The core engine can generate the final results in the form of streaming batches. The output also in the form of batches. It can allow streaming data and batch data for processing.

14. What is Spark MLlib?

Ans: Mahout is a machine learning library for Hadoop, similarly, MLlib is a Spark library. MetLib provides different algorithms, that algorithms scale-out on the cluster for data processing. Most of the data scientists use this MLlib library.

15. Why Partitions are immutable?

Ans: Every transformation generates a new partition. Partitions use HDFS API so that partition is immutable, distributed, and fault-tolerant. Partition also aware of data locality.

16. What is Transformation in spark?

Ans: Spark provides two special operations on RDDs called transformations and Actions. Transformation follows lazy operation and temporarily holds the data until unless called the Action. Each transformation generates/returns a new RDD. Example of transformations: Map, flatMap, groupByKey, reduceByKey, filter, co-group, join, sortByKey, Union, distinct, sample are common spark transformations.

17. What is Action in Spark?

Ans: Actions are RDD’s operation, that value returns back to the spar driver programs, which kick off a job to execute on a cluster. Transformation’s output is an input of Actions. reduce, collect, take a sample, take, first, saveAsTextfile, saveAsSequenceFile, countByKey, for each is common actions in Apache spark.

18. What is RDD Lineage?

Ans: Lineage is an RDD process to reconstruct lost partitions. Spark not replicate the data in memory, if data lost, RDD uses lineage to rebuild lost data. Each RDD remembers how the RDD build from other datasets.

19. What is Map and flatMap in Spark?

Ans: The map is a specific line or row to process that data. In FlatMap each input item can be mapped to multiple output items (so the function should return a Seq rather than a single item). So most frequently used to return Array elements.

20. What are broadcast variables?

Ans: Broadcast variables let the programmer keep a read-only variable cached on each machine, rather than shipping a copy of it with tasks. Spark supports 2 types of shared variables called broadcast variables (like Hadoop distributed cache) and accumulators (like Hadoop counters). Broadcast variables are stored as Array Buffers, which sends read-only values to work nodes.

21. What are Accumulators in Spark?

Ans: Spark of-line debuggers are called accumulators. Spark accumulators are similar to Hadoop counters, to count the number of events and what’s happening during the job you can use accumulators. Only the driver program can read an accumulator value, not the tasks.

22. How RDD persist the data?

Ans: There are two methods to persist the data, such as persist() to persist permanently and cache() to persist temporarily in the memory. Different storage level options there such as MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY, and many more. Both persist() and cache() uses different options depends on the task.

23. When do you use apache spark? OR What are the benefits of Spark over Mapreduce?

Ans:

Spark is really fast. As per their claims, it runs programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. It aptly utilizes RAM to produce faster results.
In the map-reduce paradigm, you write many Map-reduce tasks and then tie these tasks together using Oozie/shell script. This mechanism is very time-consuming and the map-reduce task has heavy latency.
And quite often, translating the output out of one MR job into the input of another MR job might require writing another code because Oozie may not suffice.
In Spark, you can basically do everything using a single application/console (pyspark or scala console) and get the results immediately. Switching between ‘Running something on cluster’ and ‘doing something locally’ is fairly easy and straightforward. This also leads to less context switch of the developer and more productivity.
Spark kind of equals to MapReduce and Oozie put together.

Search This Blog

Kishan’s big data world