dropna & fillna - Handling missing values in dfs¶

In PySpark dataframes, missing values are represented as NULL or None. Here, I will show you how to handle these missing values using various functions in PySpark.

Dropping Rows with Null Values¶

Drop Rows with Any Null Values:
```
df.dropna()  # or df.na.drop()
```

This will drop rows that have even one null value.

Drop Rows Where All Values Are Null:
```
df.dropna(how='all')  # or df.na.drop("all")
```
This will drop rows where all values are null.
Drop Rows with Null Values in Specific Columns:

Drop rows if country OR region have null values

df = df.dropna(subset=["country", "region"])
# Alternative: df.na.drop(subset=["country", "region"])

Filling Missing Values¶

Fill Null Values in Specific Columns:
```
df.fillna({"price": 0, "country": "unknown"})
```
If the price column has null values, replace them with 0. If the country column has null values, replace them with "unknown".

Using a Dictionary

replacements = {
    "age": 0,
    "country": "Unknown",
    "region": "Unknown",
    "income": 0,  # Adding more columns as needed
    "population": 0
}
df = df.fillna(replacements)
# Alternative: df.na.fill(replacements)

Replacing Specific Values¶

Using replace:
```
df.replace({None: "godknows"}, subset=["country"])
```
This will replace None (null) values in the country column with "godknows".

Using withColumn, when & otherwise:

from pyspark.sql.functions import when

df = df.withColumn("country", when(df["country"].isNull(), "godknows").otherwise(df["country"]))

This will replace null values in the country column with "godknows".

Using Filter

df.filter(df["age"] > 30)
# Alternative: df.where(df["age"] > 30)

Imputation¶

Fill Null Values with Mean of the Column:

from pyspark.sql.functions import mean

mean_price = df.select(mean("price")).collect()[0][0]
df = df.na.fill({"price": mean_price})

This will replace null values in the price column with the mean value of that column.