1. What happens to RDD when one of the nodes on which it is distributed goes down?
Since Spark knows how to prepare a certain data set because it is aware of various transformations and actions that have lead to the dataset, it will be able to apply the same transformations and actions to prepare the lost partition of the node which has gone down.
2. How to save RDD?
There are few methods provided by Spark:
saveAsTextFile: Write the elements of the RDD as a text file (or set of text files) to the provided directory. The directory could be in the local filesystem, HDFS or any other file system. Each element of the dataset will be converted to text using toString() method on every element. And each element will be appended with newline character “\n”
saveAsSequenceFile: Write the elements of the dataset as a Hadoop SequenceFile. This works only on the key-value pair RDD which implement Hadoop’s Writeable interface. You can load sequence file using sc.sequenceFile().
saveAsObjectFile: This simply saves data by serializing using standard java object serialization.
3. What do we mean by Paraquet?
Apache Paraquet is a columnar format for storage of data available in Hadoop ecosystem. It is space efficient storage format which can be used in any programming language and framework.
Apache Spark supports reading and writing data in Paraquet format.
4. What does it mean by Columnar Storage Format?
While converting a tabular or structured data into the stream of bytes we can either store row-wise or we could store column-wise.
In row-wise, we first store the first row and then store the second row and so on. In column-wise, we first store first column and second column.
Data in MongoDB has a flexible schema.documents in the same collection. They do not need to have the same set of fields or structure Common fields in a collection’s documents may hold different types of data. Data Model Design MongoDB provides two types of data models: — Embedded data model and Normalized data model. Based on the requirement, you can use either of the models while preparing your document. Embedded Data Model In this model, you can have (embed) all the related data in a single document, it is also known as de-normalized data model. For example, assume we are getting the details of employees in three different documents namely, Personal_details, Contact and, Address, you can embed all the three documents in a single one as shown below − { _id : , Emp_ID : "10025AE336" Personal_details :{ First_Name : "Kishan" , Last_Name : "choudhary" , Date_Of_Birth : "1995-09-26" }, Contact : { e - mail : "
Spark application, using spark-submit, is a shell command used to deploy the Spark application on a cluster. It uses all respective cluster managers through a uniform interface. Therefore, you do not have to configure your application for each one. Example Let us take the same example of word count, we used before, using shell commands. Here, we consider the same example as a spark application. Sample Input The following text is the input data and the file named is in.txt . people are not as beautiful as they look, as they walk or as they talk. they are only as beautiful as they love, as they care as they share. Look at the following program − SparkWordCount.scala import org . apache . spark . SparkContext import org . apache . spark . SparkContext . _ import org . apache . spark . _ object SparkWordCount { def main ( args : Array [ String ]) { val sc = new SparkContext ( "local" , "Word Count" , "/usr/local/spark" ,
What is Apache Sqoop? Many of us still wonder what Apache Sqoop is, its architecture, features, uses, and how it is related to big data. In this Sqoop write up, we will talk about everything along with its requirements. Let’s get started! Apache Sqoop is a big data tool for transferring data between Hadoop and relational database servers. Sqoop is used to transfer data from RDBMS (relational database management system) like MySQL and Oracle to HDFS (Hadoop Distributed File System). Big Data Sqoop can also be used to transform data in Hadoop MapReduce and then export it into RDBMS. Sqoop is a data collection and ingestion tool used to import and export data between RDBMS and HDFS. SQOOP = SQL + HADOOP Why do we need Big Data Sqoop? Sqoop Big Data Tool is primarily used for bulk data transfer to and from relational databases or mainframes. Sqoop in Big Data can import from entire tables or allow the user to specify predicates to restrict data selection. You can write directly to HDFS
Comments
Post a Comment