Skip to content

Suboptimal File Format Choices

Performance Impact

5-20x slower queries - Using row-based formats for analytical workloads instead of columnar formats.

The Problem

Using CSV or JSON for large analytical datasets forces Spark to read entire rows even when you only need specific columns, and provides no compression or optimization benefits.

❌ Problematic Code
# BAD: Large analytical datasets in row-based formats
df = spark.read.csv("10TB_dataset.csv")
# No compression, no predicate pushdown, reads entire rows
result = df.select("revenue").filter(col("date") > "2023-01-01").sum()

Format Comparison

Format Compression Predicate Pushdown Schema Evolution ACID Best For
CSV Small datasets, exports
JSON Semi-structured data
Parquet Analytics, data lakes
Delta Production data lakes

Solutions

✅ Columnar Format
# GOOD: Columnar format with compression
df = spark.read.parquet("10TB_dataset.parquet")

# Benefits:
# - 70-90% compression ratio
# - Column pruning (only read needed columns)
# - Predicate pushdown (filter at file level)
# - Fast aggregations
✅ ACID Transactions
# BETTER: Delta Lake for production
df = spark.read.format("delta").load("delta-table/")

# Additional benefits:
# - Time travel (access historical versions)
# - ACID transactions (concurrent reads/writes)
# - Schema enforcement and evolution
# - Automatic file optimization
✅ Format Migration Utility
def migrate_to_optimized_format(source_path, target_path, format_type="delta"):
    """Migrate data to optimized format with partitioning"""

    # Read source data (with schema to avoid inference)
    df = spark.read.option("inferSchema", "false").csv(source_path, header=True)

    # Add partitioning columns if date column exists
    if "date" in df.columns:
        from pyspark.sql.functions import year, month
        df = df.withColumn("year", year(col("date"))) \
               .withColumn("month", month(col("date")))
        partition_cols = ["year", "month"]
    else:
        partition_cols = None

    # Write in optimized format
    writer = df.write.mode("overwrite")

    if partition_cols:
        writer = writer.partitionBy(*partition_cols)

    if format_type == "delta":
        writer.format("delta").save(target_path)
    else:
        writer.parquet(target_path)

    print(f"Migration complete: {format_type} format at {target_path}")

    # Report compression ratio
    original_size = get_directory_size(source_path)
    new_size = get_directory_size(target_path)
    compression_ratio = original_size / new_size
    print(f"Compression ratio: {compression_ratio:.1f}x smaller")

# Usage
migrate_to_optimized_format(
    source_path="s3://bucket/csv-data/",
    target_path="s3://bucket/delta-data/",
    format_type="delta"
)
✅ Choose Right Format
def recommend_file_format(use_case, data_size_gb, update_frequency):
    """Recommend optimal file format based on requirements"""

    if use_case == "analytics" and data_size_gb > 1:
        if update_frequency == "frequent":
            return "delta"  # ACID transactions needed
        else:
            return "parquet"  # Read-optimized

    elif use_case == "streaming":
        return "delta"  # Schema evolution, ACID

    elif use_case == "ml_training":
        return "parquet"  # Fast column access

    elif data_size_gb < 0.1:
        return "csv"  # Small data, simplicity matters

    else:
        return "parquet"  # Safe default for most cases

# Usage
format_choice = recommend_file_format(
    use_case="analytics",
    data_size_gb=100,
    update_frequency="daily"
)
print(f"Recommended format: {format_choice}")

Key Takeaways

Format Selection Rules

  • Analytics workloads: Use Parquet or Delta
  • Production systems: Use Delta for ACID guarantees
  • Small datasets (< 100MB): CSV is acceptable
  • Schema evolution needs: Avoid CSV

Measuring Impact

Storage & Performance Comparison
# CSV (1TB) → Parquet (200GB) → Delta (180GB)
# Query Speed: CSV baseline → Parquet 10x faster → Delta 12x faster
# Storage Cost: CSV baseline → Parquet 80% savings → Delta 82% savings

File format choice is one of the highest-impact optimizations you can make. The migration effort pays for itself within days through improved performance and reduced storage costs.