Guide

Stream processing explained

A logistics platform ingests 40,000 GPS pings per second. Fraud analysts need rolling spend totals per card within five minutes of each swipe. Dashboards must show live order counts, not yesterday's warehouse snapshot. Stream processing is the discipline of computing on unbounded event sequences as they arrive: filter, aggregate, join, and emit results continuously instead of waiting for a nightly batch. The hard parts are time (when did the event really happen?), windows (what interval are we counting?), state (how do we remember each user's running total?), and failure (how do we restart without double-counting?). This guide covers bounded vs unbounded data, event time vs processing time, window types, watermarks and late events, stateful operations and delivery semantics, a comparison of Kafka-native engines vs Flink and Spark Structured Streaming, a Harbor Fleet card-fraud scorer worked example, an engine decision table, common pitfalls, and a production checklist — the conceptual layer beneath our event-driven architecture guide.

Batch vs stream: two shapes of data

Bounded data has a known end: a CSV export, a day's Parquet partition, a database snapshot. Batch engines ( Spark, Hive, BigQuery) read the full dataset, shuffle, aggregate, and finish. Latency is measured in minutes to hours; throughput per dollar is excellent for terabyte scans.

Unbounded data never ends: clickstreams, sensor telemetry, payment authorizations, CDC changelogs from Postgres. A stream processor treats the feed as infinite: each record arrives, transforms, and may update keyed state that lives for hours or forever. Latency targets are milliseconds to low seconds. Many production pipelines are lambda or kappa hybrids: the stream path serves real-time dashboards and alerts; a slower batch path reconciles totals and backfills history.

When streaming wins

Choose stream processing when stale data has a direct cost: fraud blocks, inventory reservations, rate limiting, live ops dashboards, or personalization that must react within a session. Stay batch-first when correctness tolerates hours of delay, when queries scan years of cold storage, or when team expertise is SQL-on-Parquet only. Micro-batching (Spark Structured Streaming's default) blurs the line — treat trigger interval as a latency knob, not a category label.

Three clocks: event, ingestion, and processing time

Every event carries timestamps, and confusing them causes silent bugs.

  • Event time — when the business action occurred (the swipe timestamp from the POS terminal). This is usually the right basis for analytics and billing.
  • Ingestion time — when the broker or collector received the record. Useful when event clocks are untrusted or missing.
  • Processing time — when the stream operator runs. Simple to implement but non-deterministic: a network delay moves an event into the wrong minute bucket on replay.

Production pipelines on Kafka almost always window by event time and use watermarks to bound lateness. A watermark at 10:05:30 means "we believe no more events with event time before 10:05:30 will arrive." When the watermark passes the end of a tumbling window, that window closes and emits its aggregate. Too-aggressive watermarks drop late events; too-conservative watermarks delay alerts.

Windows: tumbling, sliding, and session

Windows group events for aggregation. Pick the type that matches the business question.

Tumbling windows

Fixed, non-overlapping slices: orders per minute, errors per hour. A 5-minute tumbling window on event time produces one count every five minutes per key. Simple and idempotent-friendly.

Sliding windows

Overlapping intervals: "last 15 minutes of revenue, updated every minute." Each event may belong to multiple windows. Higher state cost; use when the question is explicitly rolling.

Session windows

Gap-based: if no click arrives for 30 minutes, the session closes. Length varies per user. Ideal for clickstream funnels and support-chat duration; harder to reconcile with batch SQL that expects fixed buckets.

Allowed lateness (Flink terminology; analogs exist elsewhere) sends very late events to a side output instead of discarding them — essential when mobile clients buffer offline events and sync hours later.

Stateful operations and joins

Stateless map/filter scale linearly. Real pipelines need keyed state: running sums per user_id, deduplication sets, session buffers, or join state when correlating clicks with purchases.

  • AggregationsCOUNT, SUM, approximate distinct (HyperLogLog), percentile sketches.
  • Stream-table joins — enrich events with dimension rows (product catalog) loaded from JDBC or a compacted Kafka topic.
  • Stream-stream joins — match click to checkout within a time bound; both sides buffer until the watermark allows eviction.

State must be fault-tolerant. Engines snapshot keyed state to S3 or HDFS on checkpoint barriers and store Kafka offsets alongside. On restart, replay from the last checkpoint plus log compaction yields consistent totals — the foundation of exactly-once semantics when paired with idempotent sinks.

Delivery semantics: at-most, at-least, exactly once

Networks and crashes guarantee nothing unless you engineer it.

  • At-most-once — fire and forget; may lose events on failure. Acceptable for metrics where approximate is fine.
  • At-least-once — retry until ack; duplicates possible. Consumers must be idempotent (upsert by event ID, monotonic counters with dedup store).
  • Exactly-once — transactional writes align checkpointed state with sink commits (Kafka transactions, Flink two-phase commit sinks). End-to-end exactly-once requires cooperative sources and sinks; external side effects (email, HTTP) still need idempotency keys.

Most teams implement effectively-once: at-least-once delivery plus idempotent sinks and deterministic aggregations. That is simpler to operate than full Kafka EOS across dozens of microservices.

Engine landscape: Kafka Streams, Flink, Spark

All three consume from Kafka; they differ in deployment model, latency, and API style.

  • Kafka Streams — library embedded in your JVM service; no separate cluster. Great for per-partition stateful transforms, repartition topics, and lightweight enrichment. See our dedicated Kafka guide for consumer groups; Streams builds on the same log model.
  • Apache Flink — dedicated cluster (JobManager + TaskManagers); record-by-record event-time windows, large keyed state in RocksDB, CEP, async I/O. Best sub-second latency at high state cardinality. Deep dive in our Flink guide.
  • Spark Structured Streaming — micro-batch default with optional continuous processing; unified batch and stream SQL; excellent when the same team already runs nightly Spark ETL. See our Spark guide.

Managed offerings (Confluent Cloud ksqlDB, Amazon Managed Flink, Databricks) trade operational burden for cost predictability. Start with Kafka Streams inside an existing service for one topology; graduate to Flink when latency, state size, or SQL-free CEP rules outgrow embedded libraries.

Worked example: Harbor Fleet card fraud scorer

Harbor Fleet processes card authorizations on topic payments.auth (key = card_token, event time = terminal timestamp). Requirements: flag cards with more than three declined attempts in any rolling 10-minute window, or total spend above $2,000 in 60 minutes across all Harbor merchants.

  1. Ingest — Producers write JSON with status, amount_usd, merchant_id. Partition by card_token so all events for one card land on one partition.
  2. Decline counter — Flink keyBy(card_token), window(SlidingEventTimeWindows.of(Time.minutes(10), Time.minutes(1))), count where status=DECLINED, emit alert if count > 3.
  3. Spend accumulator — separate keyed operator with 60-minute tumbling event-time window, SUM(amount_usd) for status=APPROVED.
  4. WatermarksforBoundedOutOfOrderness(45s) after measuring mobile terminal clock skew; side output for events > 2 minutes late to a reconciliation topic.
  5. Sink — alerts to fraud.alerts Kafka topic consumed by the block-list API; checkpoint to S3 every 60s; RocksDB state backend.
  6. Reconcile — nightly Spark job on Parquet archive compares stream totals to warehouse sums; drift > 0.1% pages on-call.

Decline detection must fire within seconds; spend caps tolerate slightly higher latency. Two operators with different window configs share the same source via broadcast or multi-sink topology rather than one overloaded job.

Engine decision table

EngineBest forTypical latencyOps burdenState scale
Kafka StreamsEmbedded enrichment, repartition, light aggregationsLow msLow (no cluster)Moderate per service
ksqlDB / Flink SQLDeclarative filters and tumbling SQLLow–mediumMediumHigh
Apache FlinkCEP, async I/O, large keyed state, event timeSub-secondMedium–highVery high
Spark Structured StreamingUnified batch+stream, lakehouse SQLSeconds (micro-batch)Medium (existing Spark ops)High
Managed Flink / ConfluentSmall teams, SLA-heavy productionLowLow (vendor)High

Common pitfalls

  • Processing-time windows — replay and lag shift bucket boundaries; use event time.
  • Wrong partition key — state splits across subtasks; per-user totals become wrong.
  • Watermark guesswork — no measurement of p99 lateness; either drop revenue or delay alerts.
  • Unbounded state — never evicting join buffers OOMs TaskManagers.
  • Non-idempotent sinks — at-least-once retries double-charge or duplicate alerts.
  • Ignoring backpressure — consumer lag grows until Kafka retention deletes unread data.
  • Stream-only reconciliation — no batch cross-check; silent drift erodes trust.
  • God topology — one giant job couples unrelated SLAs; split by failure domain.

Production checklist

  • Define business SLAs (latency, completeness) before picking an engine.
  • Instrument event-time skew and ingestion delay; set watermarks from data.
  • Partition Kafka topics so keyed state maps cleanly to parallelism.
  • Enable checkpointing with tested restore drills (chaos-kill TaskManagers).
  • Make sinks idempotent; use dedup keys or upsert semantics.
  • Route late events to side outputs; reconcile with batch nightly.
  • Monitor consumer lag, checkpoint duration, and backpressure per operator.
  • Version jobs with savepoints before schema or logic upgrades.
  • Document delivery semantics end-to-end; mark external side effects at-least-once.
  • Alert on lag growth, checkpoint failures, and reconciliation drift thresholds.

Key takeaways

  • Stream processing computes on unbounded feeds with low latency; batch handles bounded history.
  • Event time + watermarks produce correct windows despite network delay and skew.
  • Keyed state powers aggregations and joins; checkpoints enable recovery.
  • Exactly-once needs transactional alignment or idempotent sinks at boundaries.
  • Match engine to ops maturity: Kafka Streams for embedded, Flink for hard real-time, Spark for unified lakehouse teams.

Related reading