SPARK -DATAFRAME FILTERS

- 9/07/2022 12:05:00 AM

Consider we have a Students academic data as a CSV file and we need to filter the data of male students alone, i.e, we need to filter on Marks_obtained column. The Sample CSV file for our demo is as show below.

Problem is, Filter and fetch the records of students, whose MARKS_OBTAINED is more than or equal to 35.

Apply Filter using PySpark:

Filtering data from dataframe is a basic transformation in Apache Spark, which can be applied in different ways.

Filter can be applied directly on a Spark dataframe using filter() API
we can use where() function on Spark dataframe
we can also register the Spark dataframe as a temporary view and write a SQL query on top of it to apply filter.

We can also use multiple condition separated by and, or to apply filter conditions.

Syntax:

df.filter("Filter condition")

Let us look into both the ways one by one. Open new Jupyter notebook and initiate Spark Session and Spark Context.

Coding:

Method 1: Using Temp view

Step 1: Read the input file as a dataframe.
Step 2: Register the dataframe as a temporary view using createOrReplaceTempView().
Step 3: Write a sql query and assign the output to dataframe as below.

Snippet;

df=spark.read.option('delimiter','|').csv('input.csv',header=True)

#Create temporary view on dataframe
df.createOrReplaceTempView('filter_view')

#Write a SQL query to filter and fetch record
df_filter=spark.sql("""select * from filter_view where MARKS_OBTAINED >= 35""")

Method 2: Using Filter() or Where()

Step 1: Read the give raw file and create a Spark Dataframe to it.
Step 2: Use Filter() or Where() api, on top of Spark Dataframe. Refer the filter syntax mentioned in the above session.

The code snippet will be as follows,

df=spark.read.option('delimiter','|').csv('input.csv',header=True)

#apply filter api with filter condition
df_filter=df.filter("marks_obtained >= 35")

Output:

The filtered output is as shown below. I would recommend you to try this on your own to have a better understanding of concept.

out[]:

Full program:

Hope you enjoyed learning from this article. If you have any doubts or comments, please drop down in the below comment box.

Search This Blog

Kishan’s big data world