Table of contents
dropna & fillna - Handling missing values in dfs
In PySpark dataframes, missing values are represented as NULL
or None
. Here, I will show you how to handle these missing values using various functions in PySpark.
Dropping Rows with Null Values
- Drop Rows with Any Null Values:
df.dropna() # or df.na.drop()
This will drop rows that have even one null value.
- Drop Rows Where All Values Are Null:
df.dropna(how='all') # or df.na.drop("all")
This will drop rows where all values are null.
-
Drop Rows with Null Values in Specific Columns:
Drop rows if
country
ORregion
have null valuesdf = df.dropna(subset=["country", "region"]) # Alternative: df.na.drop(subset=["country", "region"])
Filling Missing Values
- Fill Null Values in Specific Columns:
df.fillna({"price": 0, "country": "unknown"})
If the
price
column has null values, replace them with0
. If thecountry
column has null values, replace them with"unknown"
. - Using a Dictionary
replacements = { "age": 0, "country": "Unknown", "region": "Unknown", "income": 0, # Adding more columns as needed "population": 0 } df = df.fillna(replacements) # Alternative: df.na.fill(replacements)
Replacing Specific Values
- Using replace:
df.replace({None: "godknows"}, subset=["country"])
This will replace
None
(null) values in thecountry
column with"godknows"
. - Using withColumn, when & otherwise:
from pyspark.sql.functions import when df = df.withColumn("country", when(df["country"].isNull(), "godknows").otherwise(df["country"]))
This will replace null values in the
country
column with"godknows"
. - Using Filter
df.filter(df["age"] > 30) # Alternative: df.where(df["age"] > 30)
Imputation
- Fill Null Values with Mean of the Column:
from pyspark.sql.functions import mean mean_price = df.select(mean("price")).collect()[0][0] df = df.na.fill({"price": mean_price})
This will replace null values in the
price
column with the mean value of that column.