1. What happens to RDD when one of the nodes on which it is distributed goes down?
Since Spark knows how to prepare a certain data set because it is aware of various transformations and actions that have lead to the dataset, it will be able to apply the same transformations and actions to prepare the lost partition of the node which has gone down.
2. How to save RDD?
There are few methods provided by Spark:
saveAsTextFile: Write the elements of the RDD as a text file (or set of text files) to the provided directory. The directory could be in the local filesystem, HDFS or any other file system. Each element of the dataset will be converted to text using toString() method on every element. And each element will be appended with newline character “\n”
saveAsSequenceFile: Write the elements of the dataset as a Hadoop SequenceFile. This works only on the key-value pair RDD which implement Hadoop’s Writeable interface. You can load sequence file using sc.sequenceFile().
saveAsObjectFile: This simply saves data by serializing using standard java object serialization.
3. What do we mean by Paraquet?
Apache Paraquet is a columnar format for storage of data available in Hadoop ecosystem. It is space efficient storage format which can be used in any programming language and framework.
Apache Spark supports reading and writing data in Paraquet format.
4. What does it mean by Columnar Storage Format?
While converting a tabular or structured data into the stream of bytes we can either store row-wise or we could store column-wise.
In row-wise, we first store the first row and then store the second row and so on. In column-wise, we first store first column and second column.
Apache Spark is a lightning-fast cluster computing framework designed for fast computation. With the advent of real-time processing framework in the Big Data Ecosystem, companies are using Apache Spark rigorously in their solutions. Spark SQL is a new module in Spark which integrates relational processing with Spark’s functional programming API. It supports querying data either via SQL or via the Hive Query Language. Through this blog, I will introduce you to this new exciting domain of Spark SQL. What is Spark SQL? Spark SQL integrates relational processing with Spark’s functional programming. It provides support for various data sources and makes it possible to weave SQL queries with code transformations thus resulting in a very powerful tool. Why is Spark SQL used? Spark SQL originated as Apache Hive to run on top of Spark and is now integrated with the Spark stack. Apache Hive had certain limitations as mentioned below. Spark SQL was built to overcom...
What is Apache Sqoop? Many of us still wonder what Apache Sqoop is, its architecture, features, uses, and how it is related to big data. In this Sqoop write up, we will talk about everything along with its requirements. Let’s get started! Apache Sqoop is a big data tool for transferring data between Hadoop and relational database servers. Sqoop is used to transfer data from RDBMS (relational database management system) like MySQL and Oracle to HDFS (Hadoop Distributed File System). Big Data Sqoop can also be used to transform data in Hadoop MapReduce and then export it into RDBMS. Sqoop is a data collection and ingestion tool used to import and export data between RDBMS and HDFS. SQOOP = SQL + HADOOP Why do we need Big Data Sqoop? Sqoop Big Data Tool is primarily used for bulk data transfer to and from relational databases or mainframes. Sqoop in Big Data can import from entire tables or allow the user to specify predicates to restrict data selection. You can write di...
Data in MongoDB has a flexible schema.documents in the same collection. They do not need to have the same set of fields or structure Common fields in a collection’s documents may hold different types of data. Data Model Design MongoDB provides two types of data models: — Embedded data model and Normalized data model. Based on the requirement, you can use either of the models while preparing your document. Embedded Data Model In this model, you can have (embed) all the related data in a single document, it is also known as de-normalized data model. For example, assume we are getting the details of employees in three different documents namely, Personal_details, Contact and, Address, you can embed all the three documents in a single one as shown below − { _id : , Emp_ID : "10025AE336" Personal_details :{ First_Name : "Kishan" , Last_Name : "choudhary" , Date_Of_Birth : "1995-09-26" }, Contact : { e - mail : ...
Post a Comment