PostgreSQL as a Dead Letter Queue: A DLQ You Can Query

Why Kafka DLQs often become a graveyard (and why you should care)

In event-driven systems, failures aren’t edge cases. They’re guaranteed. Downstream hydration APIs time out, consumers crash mid-flight, schemas drift, payloads arrive malformed… and your business still expects the pipeline to finish.

The default move is a Kafka “DLQ topic”. Common pattern, sure—but in many teams it turns into a black box. Messages pile up, nobody can answer basic questions (“what failed yesterday and why?”), and replaying a targeted subset requires extra tooling and custom consumers.

A late-2025 Wayfair-style case study describes exactly this pain: Kafka is great at moving events, but DLQ topics are not easy to inspect. Their pragmatic fix: use PostgreSQL as the DLQ (CloudSQL on GCP), storing failed events in a table with failure context and a status field—making failure observable and actionable (source: diljitpr.net).

DLQ, the useful definition (not the slide-deck version)

A Dead Letter Queue isolates messages that can’t be processed successfully after one or more attempts. The goal isn’t “no errors”. The goal is:

Don’t block the stream (keep consuming the good stuff).
Don’t lose data (store payload + context).
Diagnose quickly (query, audit, dashboard).
Replay safely (targeted, controlled, measurable).

A 2025 study on EDA maturity reports that 94% of mature event-driven implementations use DLQs, and 82% support reprocessing after fixing root causes. Under normal operations, DLQs are around 0.01%–0.05% of event volume, but can jump to 0.1%–0.3% during systemic incidents. (Source: ResearchGate, Event-Driven Architecture: The Backbone of Real-Time Enterprise Integration, 2025)

Translation: if your DLQ isn’t usable, you’ll burn time, money, and trust.

Why PostgreSQL makes a strong DLQ (when you want practical wins)

The idea is straightforward: instead of publishing failed messages to a DLQ topic, you persist them in Postgres.

1) Instant visibility

With Postgres, your DLQ is SQL-native:

“Top failure reasons in the last 24h”
“All failures for a specific customer_id”
“Only OrderCreated events affected”
“Failures after deploy X”

You move from opaque stream storage to queryable operational data.

2) Durability + auditing

Postgres gives you ACID guarantees, constraints, schema management, and easy audit trails. For reporting pipelines or business-critical workflows, that matters.

3) Less infrastructure, less overhead

If Postgres is already in your stack (it usually is), you add a table and a couple indexes—not another cluster, vendor, or operational surface area.

4) Safe parallel reprocessing with `FOR UPDATE SKIP LOCKED`

The pattern highlighted in the Wayfair case works well: multiple workers can fetch pending failures without stepping on each other:

SQL

SELECT id, payload
FROM dlq_events
WHERE status = 'PENDING'
ORDER BY created_at
FOR UPDATE SKIP LOCKED
LIMIT 100;

SKIP LOCKED prevents double-processing without external coordination.

A DLQ table design that won’t sabotage you

A pragmatic schema to start with:

SQL

CREATE TYPE dlq_status AS ENUM ('PENDING', 'PROCESSING', 'SUCCEEDED', 'FAILED', 'GAVE_UP');

CREATE TABLE dlq_events (
  id            BIGSERIAL PRIMARY KEY,
  event_key     TEXT,                 -- idempotency key if available
  event_type    TEXT NOT NULL,
  source        TEXT,                 -- consumer/service name
  payload       JSONB NOT NULL,
  error_code    TEXT,
  error_message TEXT,
  error_stack   TEXT,
  status        dlq_status NOT NULL DEFAULT 'PENDING',
  attempts      INT NOT NULL DEFAULT 0,
  next_retry_at TIMESTAMPTZ,
  created_at    TIMESTAMPTZ NOT NULL DEFAULT now(),
  updated_at    TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX dlq_events_status_retry_idx
  ON dlq_events (status, next_retry_at, created_at);

CREATE INDEX dlq_events_type_idx
  ON dlq_events (event_type, created_at);

CREATE INDEX dlq_events_payload_gin
  ON dlq_events USING GIN (payload);

Fields that actually matter

event_key: helps you replay without duplicates (idempotent writes).
attempts + next_retry_at: enables backoff and avoids hot-loop retries.
JSONB payload + GIN: lets you filter on business fields without ETL.

A production workflow that works

1) Failure detected → insert into DLQ

Store raw payload + context + root cause.

downstream timeout → error_code = DOWNSTREAM_TIMEOUT
schema/validation issue → error_code = VALIDATION_ERROR

2) Reprocessor (cron/worker) → lock + process

Flow:

SELECT ... FOR UPDATE SKIP LOCKED on PENDING where next_retry_at <= now()
UPDATE status='PROCESSING', attempts=attempts+1
attempt processing
success → SUCCEEDED
failure → back to PENDING with next_retry_at = now() + backoff
after N attempts → GAVE_UP + alert

3) Observability: alert on time stuck, not just counts

The useful KPIs:

max age of PENDING
DLQ ingress rate by event type
replay success rate

If a message stays pending for 2 hours and your SLA is 10 minutes, that’s an incident.

When Postgres DLQ is a great fit (and when it’s not)

Great fit if:

you have moderate-to-high throughput, but not “hundreds of thousands events/sec”
you need diagnostics + targeted replay without custom Kafka DLQ tooling
Postgres is already your durable store (like the Wayfair setup)
you want to reduce infra complexity

Not a great fit if:

your DLQ will retain millions of rows for long periods (table bloat, index pain, VACUUM pressure)
you need global multi-region active-active semantics for the DLQ
you actually need a full-blown queueing platform

Community feedback matches this: FOR UPDATE SKIP LOCKED is solid for small-to-medium workloads, but if the table grows too large you’ll need partitioning and regular cleanup (Reddit feedback referenced in the web research).

Anti-headache patterns: partitioning, retention, and bloat hygiene

1) Partition by time (or status)

If you keep DLQ data for 30 days, partition weekly/monthly:

faster queries
easy purge (DROP PARTITION)

2) Aggressive retention + archive

Keep what’s actionable in Postgres. Archive the rest to S3/GCS (daily exports) and move on.

3) Index for real queries

Don’t add 12 indexes “just in case”. Measure:

status + next_retry_at + created_at for reprocessing
event_type + created_at for analysis
JSONB GIN only if you truly need it

Going further: PGMQ, Queen, and the hybrid approach

A clear late-2025/early-2026 trend is hybrid messaging:

Kafka/SQS/RabbitMQ for high-throughput distribution
Postgres for durability, auditing, and operational failure management

Projects like Queen (an open-source Postgres-based queue) push the idea further with consumer groups, replay, exactly-once semantics, and a built-in DLQ. And in the Postgres ecosystem, PGMQ (popularized via Supabase) shows many teams prefer “one less tool” when the requirement is pragmatic.

Don’t be dogmatic: keep Kafka for what it does best, and use Postgres for what it does best.

Actionable checklist (you can ship this this week)

Create a DLQ table with status, attempts, next_retry_at, payload JSONB.
Build a reprocessor worker using FOR UPDATE SKIP LOCKED.
Add idempotency on the write side (event_key / unique constraint if possible).
Alert on max pending age + ingress rate.
Plan retention from day one (partition + purge).

Do this and your DLQ stops being a graveyard. It becomes a quality-control lever for your distributed system.

Want to automate your operations with AI? Book a 15-min call to discuss.