Creating a SparkSession¶

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("xxx").getOrCreate()
df = spark.read.csv("abc.csv", header=True, inferSchema=True)

Creating a DataFrame¶

Showing dataframe, rows describing tables¶

df.show()¶

df.show()
df.show(n=10, truncate=False)

display(df)¶

display(df.limit(3))

df.describe()¶

describe is used to generate descriptive statistics of the DataFrame. For numeric data, results include COUNT, MEAN, STD, MIN, and MAX, while for object data it will also include TOP, UNIQUE, and FREQ.

df.describe()
df.describe().show()

df.printSchema()
df.columns

DESCRIBE FORMATTED tableName¶

spark.sql("DESCRIBE FORMATTED tableName")

SQL¶

CREATE OR REPLACE VIEW doesn't work in Fabric/AzureSynapse/ADF etc. Instead use this:

If Exists (Select * From sys.sysobjects where name = 'apple')
    DROP TABLE dbo.apple;
GO

Dropping a table¶

Small dataframe

df.limit(100)

Selecting Columns
```
df.select("column1", "column2").show()
```

Filtering Data

df.filter(df["column"] > value).show()
df.filter(df["column"] == "value").show()

Adding Columns

df.withColumn("new_column", df["existing_column"] * 2).show()

Renaming Columns

df.withColumnRenamed("old_name", "new_name")

Dropping Columns
```
df.drop("column_name")
```

Grouping and Aggregating

df.groupBy("column").count().show()
df.groupBy("column").agg({"column2": "avg", "column3": "sum"}).show()

Sorting Data

df.orderBy("column").show()
df.orderBy(df["column"].desc()).show()

RDD Operations¶

Creating an RDD

rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5])

Transformations

rdd2 = rdd.map(lambda x: x * 2)
rdd3 = rdd.filter(lambda x: x % 2 == 0)

Actions

rdd.collect()
rdd.count()
rdd.first()
rdd.take(3)

SQL Operations¶

Creating Temp View

df.createOrReplaceTempView("table_name")

Running SQL Queries

spark.sql("SELECT * FROM table_name").show()

Saving Data¶

Saving as CSV
```
df.write.csv("path/to/save.csv")
```

Saving as Parquet

df.write.parquet("path/to/save.parquet")

Saving to Hive
```
df.write.saveAsTable("table_name")
```

Miscellaneous¶

Caching and Unpersisting DataFrames
```
df.cache()
df.unpersist()
```
Explain Plan
```
df.explain()
```
Repartitioning Data
```
df.repartition(10)
df.coalesce(5)
```

Pyspark when(condition).otherwise(default)¶

from pyspark.sql.functions import col, when

result = when(col("Age") > 16, True).otherwise(False)

Remember¶

The GroupBY columns must match the columns used in the SELECT statement.

DENSE_RANK() function returns the rank of each row within the result set partition, with no gaps in the ranking values. The RANK() function includes gaps in the ranking.