Creating a SparkSession
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("xxx").getOrCreate()
df = spark.read.csv("abc.csv", header=True, inferSchema=True)
Creating a DataFrame
Showing dataframe, rows describing tables
df.show()
df.show()
df.show(n=10, truncate=False)
display(df)
display(df.limit(3))
df.describe()
describe is used to generate descriptive statistics of the DataFrame. For numeric data, results include COUNT, MEAN, STD, MIN, and MAX, while for object data it will also include TOP, UNIQUE, and FREQ.
df.describe()
df.describe().show()
df.printSchema()
df.columns
DESCRIBE FORMATTED tableName
spark.sql("DESCRIBE FORMATTED tableName")
SQL
CREATE OR REPLACE VIEW
If Exists (Select * From sys.sysobjects where name = 'apple')
DROP TABLE dbo.apple;
GO
Dropping a table
Small dataframe
df.limit(100)
- Selecting Columns
df.select("column1", "column2").show()
- Filtering Data
df.filter(df["column"] > value).show() df.filter(df["column"] == "value").show()
- Adding Columns
df.withColumn("new_column", df["existing_column"] * 2).show()
- Renaming Columns
df.withColumnRenamed("old_name", "new_name")
- Dropping Columns
df.drop("column_name")
- Grouping and Aggregating
df.groupBy("column").count().show() df.groupBy("column").agg({"column2": "avg", "column3": "sum"}).show()
- Sorting Data
df.orderBy("column").show() df.orderBy(df["column"].desc()).show()
RDD Operations
- Creating an RDD
rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5])
- Transformations
rdd2 = rdd.map(lambda x: x * 2) rdd3 = rdd.filter(lambda x: x % 2 == 0)
- Actions
rdd.collect() rdd.count() rdd.first() rdd.take(3)
SQL Operations
- Creating Temp View
df.createOrReplaceTempView("table_name")
- Running SQL Queries
spark.sql("SELECT * FROM table_name").show()
Saving Data
- Saving as CSV
df.write.csv("path/to/save.csv")
- Saving as Parquet
df.write.parquet("path/to/save.parquet")
- Saving to Hive
df.write.saveAsTable("table_name")
Miscellaneous
- Caching and Unpersisting DataFrames
df.cache() df.unpersist()
- Explain Plan
df.explain()
- Repartitioning Data
df.repartition(10) df.coalesce(5)
Pyspark when(condition).otherwise(default)
from pyspark.sql.functions import col, when
result = when(col("Age") > 16, True).otherwise(False)
Remember
The GroupBY columns must match the columns used in the SELECT statement.
DENSE_RANK() function returns the rank of each row within the result set partition, with no gaps in the ranking values. The RANK() function includes gaps in the ranking.