SPARK -DATAFRAME FILTERS
Consider we have a Students academic data as a CSV file and we need to filter the data of male students alone, i.e, we need to filter on Marks_obtained column. The Sample CSV file for our demo is as show below.
Problem is, Filter and fetch the records of students, whose MARKS_OBTAINED is more than or equal to 35.
Problem is, Filter and fetch the records of students, whose MARKS_OBTAINED is more than or equal to 35.
Apply Filter using PySpark:
Filtering data from dataframe is a basic transformation in Apache Spark, which can be applied in different ways.- Filter can be applied directly on a Spark dataframe using filter() API
- we can use where() function on Spark dataframe
- we can also register the Spark dataframe as a temporary view and write a SQL query on top of it to apply filter.
We can also use multiple condition separated by and, or to apply filter conditions.
Syntax:
df.filter("Filter condition")Let us look into both the ways one by one. Open new Jupyter notebook and initiate Spark Session and Spark Context.
Coding:
Method 1: Using Temp view
Step 1: Read the input file as a dataframe.Step 2: Register the dataframe as a temporary view using createOrReplaceTempView().
Step 3: Write a sql query and assign the output to dataframe as below.
Snippet;
df=spark.read.option('delimiter','|').csv('input.csv',header=True)
#Create temporary view on dataframe
df.createOrReplaceTempView('filter_view')
#Write a SQL query to filter and fetch record
df_filter=spark.sql("""select * from filter_view where MARKS_OBTAINED >= 35""")
Method 2: Using Filter() or Where()
Step 1: Read the give raw file and create a Spark Dataframe to it.Step 2: Use Filter() or Where() api, on top of Spark Dataframe. Refer the filter syntax mentioned in the above session.
The code snippet will be as follows,
df=spark.read.option('delimiter','|').csv('input.csv',header=True)
#apply filter api with filter condition
df_filter=df.filter("marks_obtained >= 35")
Output:
The filtered output is as shown below. I would recommend you to try this on your own to have a better understanding of concept.out[]:
Full program:
Hope you enjoyed learning from this article. If you have any doubts or comments, please drop down in the below comment box.
Comments
Post a Comment