SPARK -DATAFRAME FILTERS

Consider we have a Students academic data as a CSV file and we need to filter the data of male students alone, i.e, we need to filter on Marks_obtained column. The Sample CSV file for our demo is as show below.


Problem is, Filter and fetch the records of students, whose MARKS_OBTAINED is more than or equal to 35.

Apply Filter using PySpark:

Filtering data from dataframe is a basic transformation in Apache Spark, which can be applied in different ways. 
  • Filter can be applied directly on a Spark dataframe using filter() API
  • we can use where() function on Spark dataframe 
  • we can also register the Spark dataframe as a temporary view and write a SQL query on top of it to  apply filter. 
We can also use multiple condition separated by and, or to apply filter conditions.

Syntax:

           df.filter("Filter condition")

Let us look into both the ways one by one. Open new Jupyter notebook and initiate Spark Session and Spark Context.

Coding:

Method 1: Using Temp view

Step 1: Read the input file as a dataframe.
Step 2: Register the dataframe as a temporary view using createOrReplaceTempView().
Step 3: Write a sql query and assign the output to dataframe as below.

Snippet;

df=spark.read.option('delimiter','|').csv('input.csv',header=True)                                       
                                                                                                                                                      
#Create temporary view on dataframe                                                                                    
df.createOrReplaceTempView('filter_view')                                                                           
                                                                                                                                                     
#Write a SQL query to filter and fetch record                                                                       
df_filter=spark.sql("""select * from filter_view where MARKS_OBTAINED >= 35""")


Method 2: Using Filter() or Where()

Step 1: Read the give raw file and create a Spark Dataframe to it.
Step 2: Use Filter() or Where() api, on top of Spark Dataframe. Refer the filter syntax mentioned in the above session.

The code snippet will be as follows,

df=spark.read.option('delimiter','|').csv('input.csv',header=True)
                                                                                                               
#apply filter api with filter condition                                                
df_filter=df.filter("marks_obtained >= 35")                                    


Output:

The filtered output is as shown below. I would recommend you to try this on your own to have a better understanding of concept.

out[]:

Full program:




Hope you enjoyed learning from this article. If you have any doubts or comments, please drop down in the below comment box.

Comments

Popular posts from this blog

MongoDB - Data Modelling

SPARK - Deployment

SQOOP