🛡️Satisfaction guaranteed — Setup refunded if not satisfied after 30 days

← Back to blog
techJanuary 26, 2026

PostgreSQL as a Dead Letter Queue: A DLQ You Can Query

Kafka DLQs often become black boxes. Persisting failed events in PostgreSQL gives you visibility, auditing, and targeted replays—with a simple, robust design.

Why Kafka DLQs often become a graveyard (and why you should care)

In event-driven systems, failures aren’t edge cases. They’re guaranteed. Downstream hydration APIs time out, consumers crash mid-flight, schemas drift, payloads arrive malformed… and your business still expects the pipeline to finish.

The default move is a Kafka “DLQ topic”. Common pattern, sure—but in many teams it turns into a black box. Messages pile up, nobody can answer basic questions (“what failed yesterday and why?”), and replaying a targeted subset requires extra tooling and custom consumers.

A late-2025 Wayfair-style case study describes exactly this pain: Kafka is great at moving events, but DLQ topics are not easy to inspect. Their pragmatic fix: use PostgreSQL as the DLQ (CloudSQL on GCP), storing failed events in a table with failure context and a status field—making failure observable and actionable (source: diljitpr.net).

DLQ, the useful definition (not the slide-deck version)

A Dead Letter Queue isolates messages that can’t be processed successfully after one or more attempts. The goal isn’t “no errors”. The goal is:

  1. Don’t block the stream (keep consuming the good stuff).
  2. Don’t lose data (store payload + context).
  3. Diagnose quickly (query, audit, dashboard).
  4. Replay safely (targeted, controlled, measurable).

A 2025 study on EDA maturity reports that 94% of mature event-driven implementations use DLQs, and 82% support reprocessing after fixing root causes. Under normal operations, DLQs are around 0.01%–0.05% of event volume, but can jump to 0.1%–0.3% during systemic incidents. (Source: ResearchGate, Event-Driven Architecture: The Backbone of Real-Time Enterprise Integration, 2025)

Translation: if your DLQ isn’t usable, you’ll burn time, money, and trust.

Why PostgreSQL makes a strong DLQ (when you want practical wins)

The idea is straightforward: instead of publishing failed messages to a DLQ topic, you persist them in Postgres.

1) Instant visibility

With Postgres, your DLQ is SQL-native:

  • “Top failure reasons in the last 24h”
  • “All failures for a specific customer_id”
  • “Only OrderCreated events affected”
  • “Failures after deploy X”

You move from opaque stream storage to queryable operational data.

2) Durability + auditing

Postgres gives you ACID guarantees, constraints, schema management, and easy audit trails. For reporting pipelines or business-critical workflows, that matters.

3) Less infrastructure, less overhead

If Postgres is already in your stack (it usually is), you add a table and a couple indexes—not another cluster, vendor, or operational surface area.

4) Safe parallel reprocessing with FOR UPDATE SKIP LOCKED

The pattern highlighted in the Wayfair case works well: multiple workers can fetch pending failures without stepping on each other:

SQL
SELECT id, payload
FROM dlq_events
WHERE status = 'PENDING'
ORDER BY created_at
FOR UPDATE SKIP LOCKED
LIMIT 100;

SKIP LOCKED prevents double-processing without external coordination.

A DLQ table design that won’t sabotage you

A pragmatic schema to start with:

SQL
CREATE TYPE dlq_status AS ENUM ('PENDING', 'PROCESSING', 'SUCCEEDED', 'FAILED', 'GAVE_UP');

CREATE TABLE dlq_events (
  id            BIGSERIAL PRIMARY KEY,
  event_key     TEXT,                 -- idempotency key if available
  event_type    TEXT NOT NULL,
  source        TEXT,                 -- consumer/service name
  payload       JSONB NOT NULL,
  error_code    TEXT,
  error_message TEXT,
  error_stack   TEXT,
  status        dlq_status NOT NULL DEFAULT 'PENDING',
  attempts      INT NOT NULL DEFAULT 0,
  next_retry_at TIMESTAMPTZ,
  created_at    TIMESTAMPTZ NOT NULL DEFAULT now(),
  updated_at    TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX dlq_events_status_retry_idx
  ON dlq_events (status, next_retry_at, created_at);

CREATE INDEX dlq_events_type_idx
  ON dlq_events (event_type, created_at);

CREATE INDEX dlq_events_payload_gin
  ON dlq_events USING GIN (payload);

Fields that actually matter

  • event_key: helps you replay without duplicates (idempotent writes).
  • attempts + next_retry_at: enables backoff and avoids hot-loop retries.
  • JSONB payload + GIN: lets you filter on business fields without ETL.

A production workflow that works

1) Failure detected → insert into DLQ

Store raw payload + context + root cause.

  • downstream timeout → error_code = DOWNSTREAM_TIMEOUT
  • schema/validation issue → error_code = VALIDATION_ERROR

2) Reprocessor (cron/worker) → lock + process

Flow:

  1. SELECT ... FOR UPDATE SKIP LOCKED on PENDING where next_retry_at <= now()
  2. UPDATE status='PROCESSING', attempts=attempts+1
  3. attempt processing
  4. success → SUCCEEDED
  5. failure → back to PENDING with next_retry_at = now() + backoff
  6. after N attempts → GAVE_UP + alert

3) Observability: alert on time stuck, not just counts

The useful KPIs:

  • max age of PENDING
  • DLQ ingress rate by event type
  • replay success rate

If a message stays pending for 2 hours and your SLA is 10 minutes, that’s an incident.

When Postgres DLQ is a great fit (and when it’s not)

Great fit if:

  • you have moderate-to-high throughput, but not “hundreds of thousands events/sec”
  • you need diagnostics + targeted replay without custom Kafka DLQ tooling
  • Postgres is already your durable store (like the Wayfair setup)
  • you want to reduce infra complexity

Not a great fit if:

  • your DLQ will retain millions of rows for long periods (table bloat, index pain, VACUUM pressure)
  • you need global multi-region active-active semantics for the DLQ
  • you actually need a full-blown queueing platform

Community feedback matches this: FOR UPDATE SKIP LOCKED is solid for small-to-medium workloads, but if the table grows too large you’ll need partitioning and regular cleanup (Reddit feedback referenced in the web research).

Anti-headache patterns: partitioning, retention, and bloat hygiene

1) Partition by time (or status)

If you keep DLQ data for 30 days, partition weekly/monthly:

  • faster queries
  • easy purge (DROP PARTITION)

2) Aggressive retention + archive

Keep what’s actionable in Postgres. Archive the rest to S3/GCS (daily exports) and move on.

3) Index for real queries

Don’t add 12 indexes “just in case”. Measure:

  • status + next_retry_at + created_at for reprocessing
  • event_type + created_at for analysis
  • JSONB GIN only if you truly need it

Going further: PGMQ, Queen, and the hybrid approach

A clear late-2025/early-2026 trend is hybrid messaging:

  • Kafka/SQS/RabbitMQ for high-throughput distribution
  • Postgres for durability, auditing, and operational failure management

Projects like Queen (an open-source Postgres-based queue) push the idea further with consumer groups, replay, exactly-once semantics, and a built-in DLQ. And in the Postgres ecosystem, PGMQ (popularized via Supabase) shows many teams prefer “one less tool” when the requirement is pragmatic.

Don’t be dogmatic: keep Kafka for what it does best, and use Postgres for what it does best.

Actionable checklist (you can ship this this week)

  1. Create a DLQ table with status, attempts, next_retry_at, payload JSONB.
  2. Build a reprocessor worker using FOR UPDATE SKIP LOCKED.
  3. Add idempotency on the write side (event_key / unique constraint if possible).
  4. Alert on max pending age + ingress rate.
  5. Plan retention from day one (partition + purge).

Do this and your DLQ stops being a graveyard. It becomes a quality-control lever for your distributed system.

Want to automate your operations with AI? Book a 15-min call to discuss.

PostgreSQL DLQDead Letter Queueevent-driven architectureKafka consumer retriesFOR UPDATE SKIP LOCKED

Want to automate your operations?

Let's discuss your project in 15 minutes.

Book a call