Suboptimal File Format Choices¶
Performance Impact
5-20x slower queries - Using row-based formats for analytical workloads instead of columnar formats.
The Problem¶
Using CSV or JSON for large analytical datasets forces Spark to read entire rows even when you only need specific columns, and provides no compression or optimization benefits.
❌ Problematic Code
# BAD: Large analytical datasets in row-based formats
df = spark.read.csv("10TB_dataset.csv")
# No compression, no predicate pushdown, reads entire rows
result = df.select("revenue").filter(col("date") > "2023-01-01").sum()
Format Comparison¶
Format | Compression | Predicate Pushdown | Schema Evolution | ACID | Best For |
---|---|---|---|---|---|
CSV | ❌ | ❌ | ❌ | ❌ | Small datasets, exports |
JSON | ❌ | ❌ | ✅ | ❌ | Semi-structured data |
Parquet | ✅ | ✅ | ✅ | ❌ | Analytics, data lakes |
Delta | ✅ | ✅ | ✅ | ✅ | Production data lakes |
Solutions¶
✅ Format Migration Utility
def migrate_to_optimized_format(source_path, target_path, format_type="delta"):
"""Migrate data to optimized format with partitioning"""
# Read source data (with schema to avoid inference)
df = spark.read.option("inferSchema", "false").csv(source_path, header=True)
# Add partitioning columns if date column exists
if "date" in df.columns:
from pyspark.sql.functions import year, month
df = df.withColumn("year", year(col("date"))) \
.withColumn("month", month(col("date")))
partition_cols = ["year", "month"]
else:
partition_cols = None
# Write in optimized format
writer = df.write.mode("overwrite")
if partition_cols:
writer = writer.partitionBy(*partition_cols)
if format_type == "delta":
writer.format("delta").save(target_path)
else:
writer.parquet(target_path)
print(f"Migration complete: {format_type} format at {target_path}")
# Report compression ratio
original_size = get_directory_size(source_path)
new_size = get_directory_size(target_path)
compression_ratio = original_size / new_size
print(f"Compression ratio: {compression_ratio:.1f}x smaller")
# Usage
migrate_to_optimized_format(
source_path="s3://bucket/csv-data/",
target_path="s3://bucket/delta-data/",
format_type="delta"
)
✅ Choose Right Format
def recommend_file_format(use_case, data_size_gb, update_frequency):
"""Recommend optimal file format based on requirements"""
if use_case == "analytics" and data_size_gb > 1:
if update_frequency == "frequent":
return "delta" # ACID transactions needed
else:
return "parquet" # Read-optimized
elif use_case == "streaming":
return "delta" # Schema evolution, ACID
elif use_case == "ml_training":
return "parquet" # Fast column access
elif data_size_gb < 0.1:
return "csv" # Small data, simplicity matters
else:
return "parquet" # Safe default for most cases
# Usage
format_choice = recommend_file_format(
use_case="analytics",
data_size_gb=100,
update_frequency="daily"
)
print(f"Recommended format: {format_choice}")
Key Takeaways¶
Format Selection Rules
- Analytics workloads: Use Parquet or Delta
- Production systems: Use Delta for ACID guarantees
- Small datasets (< 100MB): CSV is acceptable
- Schema evolution needs: Avoid CSV
Measuring Impact¶
Storage & Performance Comparison
# CSV (1TB) → Parquet (200GB) → Delta (180GB)
# Query Speed: CSV baseline → Parquet 10x faster → Delta 12x faster
# Storage Cost: CSV baseline → Parquet 80% savings → Delta 82% savings
File format choice is one of the highest-impact optimizations you can make. The migration effort pays for itself within days through improved performance and reduced storage costs.