Introduction

What is Stream Processing?¶

In today's world, real-time data is everywhere—viral posts, online sales, stock market, card transactions etc. Processing such data as it arrives is called stream processing.

What are the Main Items in Stream Processing?¶

Source:
This is where the data starts its journey. For example, data could come from Twitter hashtags, IoT devices, or transaction logs.
Broker(Event-Catch-and-Hold Product):
Why is this needed? Because data comes in so fast, there has to be something like a dam to buffer or hold it for a while. This is where tools like Kafka, Azure Event Hubs, Amazon Kinesis come in—they temporarily hold the data to prevent the processing system from getting overwhelmed.
Processing Engine:
The next step is processing the data, and for this, you use a processing engine. Think of Spark or Kinesis, the superstars of data engineering. These engines take the buffered data and process it in near real-time.
Storage:
Finally, the processed data needs to be stored. This is a no-brainer—it's just plain old storage. Without stream processing, it would store something else. It's not any special or fancy storage, just where the results are kept.

Spark in Stream Processing:¶

When Spark is used for stream processing, it’s not in its usual mode where it processes static data. Instead, it operates in Spark Streaming mode, where it continuously processes incoming data as it arrives.

Real-Time? Is It Really Real-Time?¶

No, it's technically impossible to have true real-time processing with zero delay. Even if there were no Kafka or Azure Event Hubs to buffer the data between the source and the processing engine, it still couldn't be 0-second real-time. Even light takes some time to travel! So, real-time in stream processing means something so fast that we humans can't catch the delay. But in today's world, factors like latency, network speed, and processing time mean there's always a tiny bit of delay.