Kishan’s big data world

Posts

Showing posts from August, 2022

Spark interview question part-5

- 8/28/2022 09:37:00 AM

1. What is Immutable? Ans: Once created and assign a value, it’s not possible to change, this property is called Immutability. Spark is by default immutable, it does not allow updates and modifications. Please note data collection is not immutable, but data value is immutable. 2. What is Distributed? Ans: RDD can automatically the data is distributed across different parallel computing nodes. 3. What is Lazy evaluated? Ans: If you execute a bunch of programs, it’s not mandatory to evaluate immediately. Especially in Transformations, this Laziness is a trigger. 4. What is Spark engine's responsibility? Ans: Spark is responsible for scheduling, distributing, and monitoring the application across the cluster. 5. What are common Spark Ecosystems? Ans: Spark SQL(Shark) for SQL developers, Spark Streaming for streaming data, MLLib for machine learning algorithms, GraphX for Graph computation, SparkR to run R on Spark engine, BlinkDB enabling interactive queries ...

Spark interview question part-4

- 8/25/2022 09:48:00 AM

1. What does reduce action do? A reduce action converts an RDD to a single value by applying recursively the provided (in argument) function on the elements of an RDD until only one value is left. The provided function must be commutative and associative – the order of arguments or in what way we apply the function should not make difference. The following diagram shows the process of applying “sum” reduce function on an RDD containing 1, 2, 3, 4. 2. What is broadcast variable? Quite often we have to send certain data such as a machine learning model to every node. The most efficient way of sending the data to all of the nodes is by the use of broadcast variables. Even though you could refer an internal variable which will get copied everywhere but the broadcast variable is far more efficient. It would be loaded into the memory on the nodes only where it is required and when it is required not all the time. It is sort of a read-only cache similar to distributed cache provide...

Spark interview question part-3

- 8/22/2022 10:49:00 PM

1. What happens to RDD when one of the nodes on which it is distributed goes down? Since Spark knows how to prepare a certain data set because it is aware of various transformations and actions that have lead to the dataset, it will be able to apply the same transformations and actions to prepare the lost partition of the node which has gone down. 2. How to save RDD? There are few methods provided by Spark: saveAsTextFile : Write the elements of the RDD as a text file (or set of text files) to the provided directory. The directory could be in the local filesystem, HDFS or any other file system. Each element of the dataset will be converted to text using toString() method on every element. And each element will be appended with newline character “\n” saveAsSequenceFile: Write the elements of the dataset as a Hadoop SequenceFile. This works only on the key-value pair RDD which implement Hadoop’s Writeable interface. You can load sequence file using sc.sequenceFile()...

Spark interview question part-2

- 8/22/2022 11:42:00 AM

1. Does Spark provide the storage layer too? No, it doesn’t provide storage layer but it lets you use many data sources. It provides the ability to read from almost every popular file systems such as HDFS, Cassandra, Hive, HBase, SQL servers. 2. Where does Spark Driver run on Yarn? If you are submitting a job with –master client, the Spark driver runs on the client’s machine. If you are submitting a job with –master yarn-cluster, the Spark driver would run inside a YARN container. 3. To use Spark on an existing Hadoop Cluster, do we need to install Spark on all nodes of Hadoop? Since Spark runs as an application on top of Yarn, it utilizes yarn for the execution of its commands over the cluster’s nodes. So, you do not need to install the Spark on all nodes. When a job is submitted, the Spark will be installed temporarily on all nodes on which execution is needed. 4. What is sparkContext? SparkContext is the entry point to Spark. Using sparkContext you create RDDs which provi...

Spark interview question part-1

- 8/21/2022 02:10:00 AM

1. What is Apache Spark and what are the benefits of Spark over MapReduce? Spark is really fast. If run in-memory it is 100x faster than Hadoop MapReduce. In Hadoop MapReduce, you write many MapReduce jobs and then tie these jobs together using Oozie/shell script. This mechanism is very time consuming and MapReduce tasks have heavy latency. Between two consecutive MapReduce jobs, the data has to be written to HDFS and read from HDFS. This is time-consuming. In case of Spark, this is avoided using RDDs and utilizing memory (RAM). And quite often, translating the output of one MapReduce job into the input of another MapReduce job might require writing another code because Oozie may not suffice. In Spark, you can basically do everything from single code or console (PySpark or Scala console) and get the results immediately. Switching between ‘Running something on cluster’ and ‘doing something locally’ is fairly easy and straightforward. This also leads to less context switch of the d...

Hive interview questions part-6

- 8/21/2022 01:14:00 AM

When you point a partition of a hive table to a new directory, what happens to the data? Write a query to insert a new column(new_col INT) into a hiev table (htab) at a position before an existing column (x_col) ALTER TABLE table_name CHANGE COLUMN new_col INT BEFORE x_col Does the archiving of Hive tables give any space saving in HDFS? No. It only reduces the number of files which becomes easier for namenode to manage. How can you stop a partition form being queried? By using the ENABLE OFFLINE clause with ALTER TABLE atatement. While loading data into a hive table using the LOAD DATA clause, how do you specify it is a hdfs file and not a local file ? By Omitting the LOCAL CLAUSE in the LOAD DATA statement. If you omit the OVERWRITE clause while creating a hive table,what happens to file which are new and files which already exist? The new incoming files are just added to the target directory and the existing files are simply overwritten. Other files whose name does not match any of ...