Guide

Message queues explained: Kafka, RabbitMQ, SQS, and delivery guarantees

When one service needs to tell another that work is ready — send a receipt email, resize an uploaded image, settle a payment — synchronous HTTP calls create tight coupling and fragile failure modes. A message queue (or message broker) sits between producers and consumers: the sender drops a message and moves on; workers pull jobs at their own pace, retry on failure, and scale horizontally. This guide explains queue vs topic semantics, the delivery guarantees brokers actually provide, how Kafka, RabbitMQ, and Amazon SQS differ in practice, and the idempotent consumer patterns that keep async systems correct when messages are delivered more than once. For the broader architectural picture, pair this with our event-driven architecture guide.

Why use a message queue?

Direct HTTP between microservices works until traffic spikes, a downstream service slows down, or a deploy takes a dependency offline. The caller blocks, threads pile up, and cascading timeouts take down services that were otherwise healthy. Queues introduce a buffer: producers enqueue work instantly; consumers drain the backlog when they have capacity.

Queues also decouple teams and deploy schedules. The billing service can publish InvoiceCreated today; the analytics team can add a new consumer next month without changing billing code. Peak shaving matters too — Black Friday order spikes land in a durable queue instead of overwhelming a fragile email API with ten thousand simultaneous connections.

When queues are the wrong tool

Not every interaction should be async. User-facing flows that need an immediate answer ("Did my payment go through?") belong in synchronous APIs or webhooks with clear status polling. Queues add latency, operational complexity, and ordering puzzles. Use them when you can tolerate seconds (or minutes) of delay, when work must survive restarts, or when multiple independent handlers should react to the same event.

Point-to-point vs publish/subscribe

In a point-to-point (work-queue) model, each message is consumed by exactly one worker. Ten thumbnail jobs enqueued means ten workers each process one job — classic load balancing. Amazon SQS standard queues and RabbitMQ work queues behave this way.

In publish/subscribe (pub/sub), one published message is delivered to every subscriber. An OrderPlaced event might trigger inventory reservation, fraud scoring, and a marketing email — three separate consumers, one copy of the event each. Kafka topics and RabbitMQ fanout/topic exchanges support this pattern.

Many systems combine both: a Kafka topic is pub/sub across consumer groups, but within a group each partition is point-to-point among group members. Understanding which semantics your broker provides prevents bugs like "two workers both charged the customer" or "nobody processed the refund."

Delivery guarantees: at-most, at-least, and exactly once

Brokers advertise delivery semantics in marketing copy; production reality is narrower. Design for what the infrastructure can prove, not what the brochure claims.

At-most-once

Messages may be lost but are never duplicated. Fire-and-forget UDP-style semantics. Rare as an explicit choice — you accept data loss to avoid duplicate side effects. Some telemetry pipelines use at-most-once when missing a metric point is cheaper than double-counting.

At-least-once (the common default)

Messages are persisted until acknowledged, but network failures and consumer crashes mean the same message can arrive twice. SQS, RabbitMQ with manual acks, and Kafka consumers that commit offsets after processing all operate here in practice. Your handler must be idempotent: processing the same payment_id twice must not double-charge. Store processed message IDs, use upserts with natural keys, or rely on database unique constraints.

Exactly-once (limited scope)

True end-to-end exactly-once across arbitrary services is effectively impossible without distributed transactions. Kafka offers "exactly-once" within a single consumer group when paired with idempotent producers and transactional writes to Kafka and an external database in one atomic commit — powerful but operationally heavy. Most teams implement effectively-once: at-least-once delivery plus idempotent consumers and deduplication tables. The saga and transactional outbox explainer covers how to publish messages only after your database commit succeeds.

Kafka: logs, partitions, and consumer groups

Apache Kafka stores messages in an append-only log divided into partitions. Producers write to a topic; each partition is an ordered sequence. Ordering is guaranteed within a partition, not across the whole topic — so messages that must stay in order (all events for user_id=42) should share a partition key.

Consumer groups divide partitions among group members. If a topic has twelve partitions and four consumers in one group, each consumer reads roughly three partitions. Add a fifth consumer and it sits idle until you add partitions — plan partition count for peak parallelism upfront; increasing partitions later can break key-based ordering for in-flight data.

Kafka retains messages for a configurable period (days or weeks), so consumers can rewind and replay — invaluable for bug fixes and new downstream services, dangerous if consumers are not idempotent. Offsets track read position; committing offsets before finishing work risks loss on crash; committing after risks duplicates on crash. Pick your poison and code accordingly.

Kafka strengths and trade-offs

  • High throughput and horizontal scale for event streams and log pipelines
  • Durable replay and multiple independent consumer groups on one topic
  • Operational overhead: ZooKeeper or KRaft, broker tuning, partition rebalancing
  • Overkill for simple task queues with low volume and few consumers

RabbitMQ: exchanges, routing, and flexible topologies

RabbitMQ routes messages through exchanges to queues. Producers publish to an exchange; bindings (with optional routing keys) determine which queues receive copies. This indirection supports work queues, pub/sub, and complex routing in one broker.

  • Direct exchange — route by exact routing key; good for task types
  • Fanout exchange — broadcast to every bound queue; classic pub/sub
  • Topic exchange — pattern matching on keys like orders.*.paid
  • Headers exchange — route on message header attributes

Consumers acknowledge (ack) messages after successful processing. Negative ack with requeue sends the message back — useful for transient failures, dangerous if the message itself is poison (malformed payload). Set a delivery limit or use a dead-letter exchange (DLX) to move failed messages aside after N attempts instead of infinite retry loops.

RabbitMQ fits teams that want flexible routing, moderate throughput, and a mature AMQP ecosystem without operating a distributed log. Memory and disk alarms matter: when the broker hits resource limits it blocks publishers — monitor queue depth and consumer lag through your observability stack.

Amazon SQS: managed queues with visibility timeouts

SQS is fully managed: no brokers to patch, pay per request, integrate natively with Lambda, ECS, and SNS. Two queue types matter:

  • Standard queues — nearly unlimited throughput, best-effort ordering, at-least-once delivery
  • FIFO queues — strict ordering within a message group, exactly-once processing via deduplication IDs (within a five-minute window)

When a consumer receives a message, it becomes invisible to other consumers for the visibility timeout. If the worker crashes before deleting the message, it reappears for another attempt. Set the timeout longer than your p99 processing time but short enough that stuck jobs retry reasonably fast.

Pair SQS with a dead-letter queue (DLQ) after a maxReceiveCount threshold. DLQs are where you inspect poison messages, fix bugs, and replay manually — never let failed messages disappear silently. For fan-out, SNS can publish to multiple SQS queues or Lambda functions in one shot.

Dead-letter queues and poison messages

A poison message fails every processing attempt — bad JSON, a reference to a deleted record, a bug in the handler. Without a DLQ, it blocks the queue (in FIFO), burns retry budget, or loops forever. Route failures to a DLQ after a fixed number of attempts; alert on DLQ depth; build a replay tool that fixes the payload or code and re-injects messages.

DLQs are also an audit trail. When a payment webhook handler rejects an event, the DLQ preserves the raw payload for forensic review — similar discipline applies to on-chain settlement pipelines where duplicate or out-of-order events must not double-pay winners.

Designing idempotent consumers

Because at-least-once is the realistic default, every consumer should tolerate duplicates. Proven patterns:

  1. Natural-key upsertINSERT ... ON CONFLICT DO NOTHING on event_id
  2. Idempotency key table — check-and-set before side effects; same pattern as REST idempotency keys
  3. State machine guards — only transition PENDING → PAID if current state allows it
  4. Outbox + single writer — one service owns the row; consumers are read-only projections

Make handlers fast and delegate: acknowledge quickly after persisting intent, then do slow work in a second stage if needed. Long processing inside the visibility window risks duplicate delivery when the first worker is still running. Log correlation IDs on every message so retries show up as linked traces, not mystery duplicates.

Choosing a broker: a practical decision matrix

NeedOften fits
High-volume event streaming, replay, analytics pipelinesKafka (or managed Confluent / MSK)
Flexible routing, moderate volume, on-prem or self-hostedRabbitMQ
Serverless AWS, simple task queues, minimal opsSQS (+ SNS for fan-out)
Strict per-entity ordering, moderate AWS volumeSQS FIFO with message group IDs
Low volume, already on RedisRedis Streams or Bull/BullMQ (know persistence limits)

Start with the simplest option that meets throughput and ordering needs. Migrating from SQS to Kafka later is painful; running Kafka for fifty messages per minute is equally wasteful. Prototype with managed services, measure consumer lag and error rates, and upgrade when routing or replay requirements outgrow the first choice.

Operational checklist

  1. Monitor queue depth and age of oldest message — lag is the primary health signal
  2. Alert on DLQ growth before customers notice missing emails or stuck orders
  3. Size visibility timeouts and consumer concurrency to your p99 handler duration
  4. Version message schemas — additive fields, explicit schema_version, reject or quarantine unknown versions
  5. Load-test backpressure — what happens when consumers are down for ten minutes?
  6. Document ordering requirements per topic — global order is expensive; partition-key order is cheap
  7. Pair with rate limits on downstream APIs so queue drain does not trigger 429 storms

Related reading