SPARK -DATAFRAME FILTERS

Consider we have a Students academic data as a CSV file and we need to filter the data of male students alone, i.e, we need to filter on Marks_obtained column. The Sample CSV file for our demo is as show below.


Problem is, Filter and fetch the records of students, whose MARKS_OBTAINED is more than or equal to 35.

Apply Filter using PySpark:

Filtering data from dataframe is a basic transformation in Apache Spark, which can be applied in different ways. 
  • Filter can be applied directly on a Spark dataframe using filter() API
  • we can use where() function on Spark dataframe 
  • we can also register the Spark dataframe as a temporary view and write a SQL query on top of it to  apply filter. 
We can also use multiple condition separated by and, or to apply filter conditions.

Syntax:

           df.filter("Filter condition")

Let us look into both the ways one by one. Open new Jupyter notebook and initiate Spark Session and Spark Context.

Coding:

Method 1: Using Temp view

Step 1: Read the input file as a dataframe.
Step 2: Register the dataframe as a temporary view using createOrReplaceTempView().
Step 3: Write a sql query and assign the output to dataframe as below.

Snippet;

df=spark.read.option('delimiter','|').csv('input.csv',header=True)                                       
                                                                                                                                                      
#Create temporary view on dataframe                                                                                    
df.createOrReplaceTempView('filter_view')                                                                           
                                                                                                                                                     
#Write a SQL query to filter and fetch record                                                                       
df_filter=spark.sql("""select * from filter_view where MARKS_OBTAINED >= 35""")


Method 2: Using Filter() or Where()

Step 1: Read the give raw file and create a Spark Dataframe to it.
Step 2: Use Filter() or Where() api, on top of Spark Dataframe. Refer the filter syntax mentioned in the above session.

The code snippet will be as follows,

df=spark.read.option('delimiter','|').csv('input.csv',header=True)
                                                                                                               
#apply filter api with filter condition                                                
df_filter=df.filter("marks_obtained >= 35")                                    


Output:

The filtered output is as shown below. I would recommend you to try this on your own to have a better understanding of concept.

out[]:

Full program:




Hope you enjoyed learning from this article. If you have any doubts or comments, please drop down in the below comment box.

Comments

Popular posts from this blog

Spark interview question part-1

MongoDB - Data Modelling

SPARK - Deployment