Table of contents
  1. dropna & fillna - Handling missing values in dfs
    1. Dropping Rows with Null Values
    2. Filling Missing Values
    3. Replacing Specific Values
    4. Imputation

dropna & fillna - Handling missing values in dfs

In PySpark dataframes, missing values are represented as NULL or None. Here, I will show you how to handle these missing values using various functions in PySpark.

Dropping Rows with Null Values

  • Drop Rows with Any Null Values:
    df.dropna()  # or df.na.drop()
    

    This will drop rows that have even one null value.

  • Drop Rows Where All Values Are Null:
    df.dropna(how='all')  # or df.na.drop("all")
    

    This will drop rows where all values are null.

  • Drop Rows with Null Values in Specific Columns:

    Drop rows if country OR region have null values

    df = df.dropna(subset=["country", "region"])
    # Alternative: df.na.drop(subset=["country", "region"])
    

Filling Missing Values

  • Fill Null Values in Specific Columns:
    df.fillna({"price": 0, "country": "unknown"})
    

    If the price column has null values, replace them with 0. If the country column has null values, replace them with "unknown".

  • Using a Dictionary
    replacements = {
        "age": 0,
        "country": "Unknown",
        "region": "Unknown",
        "income": 0,  # Adding more columns as needed
        "population": 0
    }
    df = df.fillna(replacements)
    # Alternative: df.na.fill(replacements)
    

Replacing Specific Values

  • Using replace:
    df.replace({None: "godknows"}, subset=["country"])
    

    This will replace None (null) values in the country column with "godknows".

  • Using withColumn, when & otherwise:
    from pyspark.sql.functions import when
    
    df = df.withColumn("country", when(df["country"].isNull(), "godknows").otherwise(df["country"]))
    

    This will replace null values in the country column with "godknows".

  • Using Filter
    df.filter(df["age"] > 30)
    # Alternative: df.where(df["age"] > 30)
    

Imputation

  • Fill Null Values with Mean of the Column:
    from pyspark.sql.functions import mean
    
    mean_price = df.select(mean("price")).collect()[0][0]
    df = df.na.fill({"price": mean_price})
    

    This will replace null values in the price column with the mean value of that column.