Why Kafka DLQs often become a graveyard (and why you should care)
In event-driven systems, failures aren’t edge cases. They’re guaranteed. Downstream hydration APIs time out, consumers crash mid-flight, schemas drift, payloads arrive malformed… and your business still expects the pipeline to finish.
The default move is a Kafka “DLQ topic”. Common pattern, sure—but in many teams it turns into a black box. Messages pile up, nobody can answer basic questions (“what failed yesterday and why?”), and replaying a targeted subset requires extra tooling and custom consumers.
A late-2025 Wayfair-style case study describes exactly this pain: Kafka is great at moving events, but DLQ topics are not easy to inspect. Their pragmatic fix: use PostgreSQL as the DLQ (CloudSQL on GCP), storing failed events in a table with failure context and a status field—making failure observable and actionable (source: diljitpr.net).
DLQ, the useful definition (not the slide-deck version)
A Dead Letter Queue isolates messages that can’t be processed successfully after one or more attempts. The goal isn’t “no errors”. The goal is:
- Don’t block the stream (keep consuming the good stuff).
- Don’t lose data (store payload + context).
- Diagnose quickly (query, audit, dashboard).
- Replay safely (targeted, controlled, measurable).
A 2025 study on EDA maturity reports that 94% of mature event-driven implementations use DLQs, and 82% support reprocessing after fixing root causes. Under normal operations, DLQs are around 0.01%–0.05% of event volume, but can jump to 0.1%–0.3% during systemic incidents. (Source: ResearchGate, Event-Driven Architecture: The Backbone of Real-Time Enterprise Integration, 2025)
Translation: if your DLQ isn’t usable, you’ll burn time, money, and trust.
Why PostgreSQL makes a strong DLQ (when you want practical wins)
The idea is straightforward: instead of publishing failed messages to a DLQ topic, you persist them in Postgres.
1) Instant visibility
With Postgres, your DLQ is SQL-native:
- “Top failure reasons in the last 24h”
- “All failures for a specific customer_id”
- “Only
OrderCreatedevents affected” - “Failures after deploy X”
You move from opaque stream storage to queryable operational data.
2) Durability + auditing
Postgres gives you ACID guarantees, constraints, schema management, and easy audit trails. For reporting pipelines or business-critical workflows, that matters.
3) Less infrastructure, less overhead
If Postgres is already in your stack (it usually is), you add a table and a couple indexes—not another cluster, vendor, or operational surface area.
4) Safe parallel reprocessing with FOR UPDATE SKIP LOCKED
The pattern highlighted in the Wayfair case works well: multiple workers can fetch pending failures without stepping on each other:
SELECT id, payload
FROM dlq_events
WHERE status = 'PENDING'
ORDER BY created_at
FOR UPDATE SKIP LOCKED
LIMIT 100;SKIP LOCKED prevents double-processing without external coordination.
A DLQ table design that won’t sabotage you
A pragmatic schema to start with:
CREATE TYPE dlq_status AS ENUM ('PENDING', 'PROCESSING', 'SUCCEEDED', 'FAILED', 'GAVE_UP');
CREATE TABLE dlq_events (
id BIGSERIAL PRIMARY KEY,
event_key TEXT, -- idempotency key if available
event_type TEXT NOT NULL,
source TEXT, -- consumer/service name
payload JSONB NOT NULL,
error_code TEXT,
error_message TEXT,
error_stack TEXT,
status dlq_status NOT NULL DEFAULT 'PENDING',
attempts INT NOT NULL DEFAULT 0,
next_retry_at TIMESTAMPTZ,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX dlq_events_status_retry_idx
ON dlq_events (status, next_retry_at, created_at);
CREATE INDEX dlq_events_type_idx
ON dlq_events (event_type, created_at);
CREATE INDEX dlq_events_payload_gin
ON dlq_events USING GIN (payload);Fields that actually matter
- event_key: helps you replay without duplicates (idempotent writes).
- attempts + next_retry_at: enables backoff and avoids hot-loop retries.
- JSONB payload + GIN: lets you filter on business fields without ETL.
A production workflow that works
1) Failure detected → insert into DLQ
Store raw payload + context + root cause.
- downstream timeout →
error_code = DOWNSTREAM_TIMEOUT - schema/validation issue →
error_code = VALIDATION_ERROR
2) Reprocessor (cron/worker) → lock + process
Flow:
SELECT ... FOR UPDATE SKIP LOCKEDonPENDINGwherenext_retry_at <= now()UPDATE status='PROCESSING', attempts=attempts+1- attempt processing
- success →
SUCCEEDED - failure → back to
PENDINGwithnext_retry_at = now() + backoff - after N attempts →
GAVE_UP+ alert
3) Observability: alert on time stuck, not just counts
The useful KPIs:
- max age of
PENDING - DLQ ingress rate by event type
- replay success rate
If a message stays pending for 2 hours and your SLA is 10 minutes, that’s an incident.
When Postgres DLQ is a great fit (and when it’s not)
Great fit if:
- you have moderate-to-high throughput, but not “hundreds of thousands events/sec”
- you need diagnostics + targeted replay without custom Kafka DLQ tooling
- Postgres is already your durable store (like the Wayfair setup)
- you want to reduce infra complexity
Not a great fit if:
- your DLQ will retain millions of rows for long periods (table bloat, index pain, VACUUM pressure)
- you need global multi-region active-active semantics for the DLQ
- you actually need a full-blown queueing platform
Community feedback matches this: FOR UPDATE SKIP LOCKED is solid for small-to-medium workloads, but if the table grows too large you’ll need partitioning and regular cleanup (Reddit feedback referenced in the web research).
Anti-headache patterns: partitioning, retention, and bloat hygiene
1) Partition by time (or status)
If you keep DLQ data for 30 days, partition weekly/monthly:
- faster queries
- easy purge (
DROP PARTITION)
2) Aggressive retention + archive
Keep what’s actionable in Postgres. Archive the rest to S3/GCS (daily exports) and move on.
3) Index for real queries
Don’t add 12 indexes “just in case”. Measure:
status + next_retry_at + created_atfor reprocessingevent_type + created_atfor analysis- JSONB GIN only if you truly need it
Going further: PGMQ, Queen, and the hybrid approach
A clear late-2025/early-2026 trend is hybrid messaging:
- Kafka/SQS/RabbitMQ for high-throughput distribution
- Postgres for durability, auditing, and operational failure management
Projects like Queen (an open-source Postgres-based queue) push the idea further with consumer groups, replay, exactly-once semantics, and a built-in DLQ. And in the Postgres ecosystem, PGMQ (popularized via Supabase) shows many teams prefer “one less tool” when the requirement is pragmatic.
Don’t be dogmatic: keep Kafka for what it does best, and use Postgres for what it does best.
Actionable checklist (you can ship this this week)
- Create a DLQ table with
status,attempts,next_retry_at,payload JSONB. - Build a reprocessor worker using
FOR UPDATE SKIP LOCKED. - Add idempotency on the write side (
event_key/ unique constraint if possible). - Alert on max pending age + ingress rate.
- Plan retention from day one (partition + purge).
Do this and your DLQ stops being a graveyard. It becomes a quality-control lever for your distributed system.
Want to automate your operations with AI? Book a 15-min call to discuss.
