Creating a SparkSession¶
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("xxx").getOrCreate()
df = spark.read.csv("abc.csv", header=True, inferSchema=True)

Creating a DataFrame¶
Showing dataframe, rows describing tables¶
df.show()¶

display(df)¶

df.describe()¶
describe is used to generate descriptive statistics of the DataFrame. For numeric data, results include COUNT, MEAN, STD, MIN, and MAX, while for object data it will also include TOP, UNIQUE, and FREQ.

DESCRIBE FORMATTED tableName¶

SQL¶
CREATE OR REPLACE VIEW
Dropping a table¶
Small dataframe
df.limit(100)
-
Selecting Columns
-
Filtering Data
-
Adding Columns
-
Renaming Columns
-
Dropping Columns
-
Grouping and Aggregating
df.groupBy("column").count().show() df.groupBy("column").agg({"column2": "avg", "column3": "sum"}).show()
- Sorting Data
RDD Operations¶
-
Creating an RDD
-
Transformations
-
Actions
SQL Operations¶
-
Creating Temp View
-
Running SQL Queries
Saving Data¶
Miscellaneous¶
-
Caching and Unpersisting DataFrames
-
Explain Plan
-
Repartitioning Data
Pyspark when(condition).otherwise(default)¶
Remember¶
The GroupBY columns must match the columns used in the SELECT statement.
DENSE_RANK() function returns the rank of each row within the result set partition, with no gaps in the ranking values. The RANK() function includes gaps in the ranking.