If you’re new to PySpark, or even if you’ve been using it for a while, you’ve probably seen both where()
and filter()
in code examples. These two methods are frequently used for filtering rows based on specific conditions. But which one should you use? And what exactly do they do? In this blog, we’ll break down the similarities and subtle differences between where
and filter
in PySpark.
The Basics: where
vs. filter
PySpark’s where()
and filter()
are essentially two sides of the same coin. They both allow you to keep only those rows in your DataFrame that satisfy a given condition. In fact, they are interchangeable — you can use where
or filter
for the same result.
Key Takeaway
In PySpark,
where()
andfilter()
do the same thing and are synonyms for each other. You can choose whichever makes your code more readable.
How to Use where
and filter
Let’s jump into some examples to see how they work.
Example 1: Filtering with filter
Let’s say you have a DataFrame of customer data and you want to select only those customers who are over the age of 30.
Create Dataframe
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
# Set up your Spark session
spark = SparkSession.builder.appName("FilterExample").getOrCreate()
# Sample DataFrame
data = [
("Alice", 25),
("Bob", 35),
("Charlie", 30)
]
columns = ["name", "age"]
df = spark.createDataFrame(data, schema=columns)
# Output
+-------+---+
| name|age|
+-------+---+
| Alice| 25|
| Bob| 35|
|Charlie| 30|
+-------+---+
Filter using filter
# Using filter to select customers over 30
filtered_df = df.filter(F.col("age") > 30)
filtered_df.show()
# output
+----+---+
|name|age|
+----+---+
| Bob| 35|
+----+---+
Example 2: Filtering with where
# Using where to select customers over 30
where_df = df.where(F.col("age") > 30)
where_df.show()
# output
+----+---+
|name|age|
+----+---+
| Bob| 35|
+----+---+
As you can see, both filter()
and where()
produce identical outputs.
Using Multiple Conditions
Whether you use where
or filter
, you can specify multiple conditions by chaining them with &
for “and” conditions or |
for “or” conditions.
Example 3: Using Multiple Conditions with where
Suppose you want to select customers who are over 30 years old and whose names start with the letter “B”.
# Using where with multiple conditions
where_multiple_df = df.where((F.col("age") > 30) & (F.col("name").startswith("B")))
where_multiple_df.show()
# output
+----+---+
|name|age|
+----+---+
| Bob| 35|
+----+---+
You could achieve the same with filter()
as well. Just replace where
with filter
, and you’ll get identical results.
When to Use where
vs. filter
Since they are functionally identical, it’s up to you! Many PySpark users prefer where
because it feels more SQL-like, and SQL users are used to filtering with WHERE
clauses. Others prefer filter
because it feels more Pythonic and might be familiar from other data manipulation libraries.
Performance
The use of where
vs. filter
does not impact performance in PySpark. Both methods compile down to the same execution plan, so feel free to choose based on readability alone.
Final Thoughts
To sum it up, PySpark’s where
and filter
are your go-to methods for selecting rows based on conditions. Since they are identical, choosing between them is mostly about preference and readability. Use them to make your data selection easier and keep your PySpark code clean and clear!
Happy filtering! 🚀