Background

In Spark, transformations are classified into two types: narrow and wide.

Narrow Transformations

Definition: Each input partition contributes to only one output partition.
Example Operations: map(), filter(), select()
Characteristics:
- No data shuffling across the network.
- Fast and efficient.
- Data can be processed in parallel without dependencies on other partitions.

Remember:
When we say "each input partition contributes to only one output partition," we mean that in a narrow transformation, the data from an input partition is processed and stored in exactly one corresponding output partition without the need to shuffle or move data between different partitions.

Example

Consider a simple filter() operation that filters out rows based on some condition:

# Assume we have an RDD with 4 partitions
rdd = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8], 4)

# Apply a filter transformation
filtered_rdd = rdd.filter(lambda x: x % 2 == 0)

Input Partitions: [ [1, 2], [3, 4], [5, 6], [7, 8] ]
Filter Condition: Keep only even numbers
Output Partitions:
- Partition 1 (Input: [1, 2]) → Partition 1 (Output: [2])
- Partition 2 (Input: [3, 4]) → Partition 2 (Output: [4])
- Partition 3 (Input: [5, 6]) → Partition 3 (Output: [6])
- Partition 4 (Input: [7, 8]) → Partition 4 (Output: [8])

Each input partition (e.g., [1, 2]) is processed and produces output that remains within the same partition (e.g., [2]). There is no data movement between partitions.

Wide Transformations

Definition: Each input partition contributes to multiple output partitions.
Example Operations: reduceByKey(), groupBy(), join()
Characteristics:
- Data is shuffled across the network.
- Slower compared to narrow transformations due to data movement.
- Requires synchronization between partitions.

Example

Now, consider a reduceByKey() operation which involves shuffling:

# Assume we have an RDD with key-value pairs
rdd = sc.parallelize([(1, 2), (3, 4), (3, 6), (1, 4)], 2)

# Apply a reduceByKey transformation
reduced_rdd = rdd.reduceByKey(lambda x, y: x + y)

Input Partitions: [ [(1, 2), (3, 4)], [(3, 6), (1, 4)] ]
Reduce by Key: Combine values with the same key
Output Partitions:
- Partition 1 (Input: [(1, 2), (1, 4)]) → Partition 1 (Output: [(1, 6)])
- Partition 2 (Input: [(3, 4), (3, 6)]) → Partition 2 (Output: [(3, 10)])

In this case, keys with the same value (e.g., key = 1 or key = 3) need to be brought together into the same partition to perform the reduction. This requires shuffling data across the network, making it a wide transformation.

Key Differences

Data Movement: Narrow transformations do not shuffle data, while wide transformations do. Performance: Narrow transformations are faster and more efficient because there’s no data movement. Parallelism: Narrow transformations are easier to process in parallel, while wide transformations need to handle data dependencies and synchronization.