· 7 min read ·

The Outbox Pattern in Go and Postgres: Three Ways to Get It Right

Source: lobsters

Distributed systems have a consistency problem that no amount of retry logic can fix cleanly. You write a record to Postgres, then publish an event to your message broker. If the broker call fails after the database commit, the event is gone and your downstream services never know the order was placed. If somehow the publish succeeds but the database write fails, you’ve emitted a phantom event. Neither scenario is acceptable in production, and the naive fix, wrapping both calls in application-level retry logic, does not eliminate the gap, it only makes the failure window smaller.

This is the dual-write problem. The Transactional Outbox pattern, popularized by Chris Richardson, solves it by exploiting the one guarantee you already have: your database’s ACID transaction. This recent walkthrough covers a Go and Postgres implementation, and it’s a good starting point. What I want to do here is go further, looking at three distinct implementation strategies, the PostgreSQL internals that make each one work, and when the operational trade-offs tip you toward one approach over another.

Why the Outbox Works

The core insight is simple. Instead of publishing an event directly to Kafka or NATS or RabbitMQ, you insert the event into an outbox table in the same database transaction as your business write. A separate relay process reads from that table and publishes to the broker. Because the outbox insert and the business insert share a transaction, they either both succeed or both fail. The relay handles publication with its own retry logic.

This guarantees at-least-once delivery, not exactly-once. Consumers must be idempotent. That trade-off is accepted practice in modern distributed systems and is explicitly part of the pattern’s design.

A minimal outbox table in Postgres looks like this:

CREATE TABLE outbox (
    id         UUID        PRIMARY KEY DEFAULT gen_random_uuid(),
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    topic      TEXT        NOT NULL,
    payload    JSONB       NOT NULL,
    sent_at    TIMESTAMPTZ
);

CREATE INDEX idx_outbox_unsent ON outbox (created_at)
    WHERE sent_at IS NULL;

That last line matters more than it looks. The partial index WHERE sent_at IS NULL only indexes unsent rows. As your system processes millions of messages over time, the index stays small and the relay query stays fast. Without it, a full index grows unbounded and your relay query eventually degrades to a near-sequential scan.

The business transaction side is straightforward:

func CreateOrder(ctx context.Context, db *sql.DB, order Order) error {
    tx, err := db.BeginTx(ctx, nil)
    if err != nil {
        return err
    }
    defer tx.Rollback()

    _, err = tx.ExecContext(ctx,
        `INSERT INTO orders (id, customer_id, total, status)
         VALUES ($1, $2, $3, 'pending')`,
        order.ID, order.CustomerID, order.Total,
    )
    if err != nil {
        return err
    }

    payload, _ := json.Marshal(OrderCreatedEvent{
        OrderID:    order.ID,
        CustomerID: order.CustomerID,
        Total:      order.Total,
    })
    _, err = tx.ExecContext(ctx,
        `INSERT INTO outbox (topic, payload) VALUES ($1, $2)`,
        "orders.created", payload,
    )
    if err != nil {
        return err
    }

    return tx.Commit()
}

Both writes commit atomically. Now the question is how to run the relay.

Approach 1: Polling with FOR UPDATE SKIP LOCKED

The simplest relay is a goroutine that wakes on a timer, queries the outbox, and publishes each batch. The key Postgres primitive here is FOR UPDATE SKIP LOCKED, available since version 9.5. It allows multiple relay workers to grab different rows from the table simultaneously without blocking each other.

func (r *Relay) processBatch(ctx context.Context) error {
    rows, err := r.db.QueryContext(ctx, `
        SELECT id, topic, payload
        FROM outbox
        WHERE sent_at IS NULL
        ORDER BY created_at
        LIMIT 100
        FOR UPDATE SKIP LOCKED
    `)
    if err != nil {
        return err
    }
    defer rows.Close()

    var ids []uuid.UUID
    for rows.Next() {
        var msg OutboxMessage
        if err := rows.Scan(&msg.ID, &msg.Topic, &msg.Payload); err != nil {
            return err
        }
        if err := r.publisher.Publish(ctx, msg); err != nil {
            return err
        }
        ids = append(ids, msg.ID)
    }

    if len(ids) == 0 {
        return nil
    }

    _, err = r.db.ExecContext(ctx,
        `UPDATE outbox SET sent_at = NOW() WHERE id = ANY($1)`,
        pq.Array(ids),
    )
    return err
}

Without SKIP LOCKED, two workers contending on the same rows would block, defeating the purpose of parallelism. With it, each worker locks a disjoint set of rows and proceeds immediately. The trade-off is polling latency: at a 500ms interval, your worst-case event delay is 500ms. At 1s, it’s 1s. For most transactional workloads, this is acceptable.

Update churn is a real concern at scale. Every processed row gets an UPDATE to set sent_at, generating dead tuples that autovacuum must reclaim. For high-volume outbox tables, tune autovacuum aggressively:

ALTER TABLE outbox SET (
    autovacuum_vacuum_scale_factor = 0.01,
    autovacuum_analyze_scale_factor = 0.01
);

Alternatively, delete rows instead of marking them sent, and keep a separate archive table for audit purposes.

Approach 2: LISTEN/NOTIFY for Lower Latency

Postgres’s LISTEN/NOTIFY mechanism lets the relay react immediately when a new outbox row is committed, rather than waiting for the next poll cycle. A trigger fires after each insert and sends a lightweight notification over the replication connection.

CREATE OR REPLACE FUNCTION notify_outbox()
RETURNS TRIGGER LANGUAGE plpgsql AS $$
BEGIN
    PERFORM pg_notify('outbox_chan', NEW.id::text);
    RETURN NEW;
END;
$$;

CREATE TRIGGER outbox_notify
    AFTER INSERT ON outbox
    FOR EACH ROW EXECUTE FUNCTION notify_outbox();
func (r *NotifyRelay) Run(ctx context.Context) error {
    conn, err := pgx.Connect(ctx, r.dsn)
    if err != nil {
        return err
    }
    defer conn.Close(ctx)

    if _, err := conn.Exec(ctx, "LISTEN outbox_chan"); err != nil {
        return err
    }

    for {
        _, err := conn.WaitForNotification(ctx)
        if err != nil {
            if ctx.Err() != nil {
                return ctx.Err()
            }
            return err
        }
        if err := r.processBatch(ctx); err != nil {
            log.Printf("relay error: %v", err)
        }
    }
}

This drives latency from hundreds of milliseconds down to single-digit milliseconds in most cases. The notification fires only after the transaction commits, so there is no risk of processing a row from a rolled-back transaction.

There is one critical catch. Postgres stores pending notifications in shared memory (pg_notification_queue), and if the queue fills, notifications are silently dropped. The relay also misses any notifications that fire while it is disconnected. Both scenarios mean the relay must still run a periodic fallback poll, perhaps every 30 seconds, to catch anything it missed. The LISTEN path handles the normal case; the poll handles edge cases.

Keep notification payloads small. The Postgres limit is 8000 bytes. Sending just the row UUID and letting the relay fetch the full payload is the right approach.

Approach 3: CDC via the Write-Ahead Log

Change Data Capture reads Postgres’s Write-Ahead Log directly rather than querying the outbox table. Instead of polling rows, your relay subscribes to the logical replication stream and receives every committed INSERT to the outbox table in sub-millisecond time.

This requires wal_level = logical in your Postgres configuration and a replication slot:

SELECT pg_create_logical_replication_slot('outbox_slot', 'pgoutput');
CREATE PUBLICATION outbox_pub FOR TABLE outbox;

The pglogrepl library (from the pgx author) handles the replication protocol on the Go side. It connects using a raw replication connection via pgconn, then decodes the logical stream:

conn, _ := pgconn.Connect(ctx, dsn+"?replication=database")
pglogrepl.StartReplication(ctx, conn, "outbox_slot", startLSN,
    pglogrepl.StartReplicationOptions{
        PluginArgs: []string{
            "proto_version '2'",
            "publication_names 'outbox_pub'",
        },
    })

The relay then reads XLogData messages from the stream. Each message carries the full row data for every committed insert, with no polling, no index scans, and no lock contention.

CDC throughput is significantly higher than polling because WAL reading is sequential I/O at the storage level. The operational cost is managing the replication slot. If your relay falls behind or goes down, Postgres must retain WAL segments until the slot catches up. This can fill your disk. Postgres 13 added max_slot_wal_keep_size to cap this, at the cost of potentially invalidating a lagging slot. Monitor slot lag in production:

SELECT slot_name,
       pg_size_pretty(
           pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)
       ) AS lag
FROM pg_replication_slots;

CDC also typically requires the relay to have REPLICATION privilege or superuser access. On managed Postgres services where you lack that, polling or LISTEN/NOTIFY are your practical options.

Ecosystem Libraries

Watermill from ThreeDotsLabs is the most complete Go library for this pattern. Its watermill-sql package provides a PostgreSQL-backed publisher and subscriber that handle the outbox table, the relay loop, and forwarding to Kafka, AMQP, NATS, or HTTP. It uses FOR UPDATE SKIP LOCKED internally, configurable batch sizes, and a SchemaAdapter interface if you want to control the table structure.

River takes a different angle: it’s a full Postgres job queue that uses pgx/v5 natively and LISTEN/NOTIFY for immediate job pickup. You can use it as the relay worker by inserting outbox events as River jobs inside your business transaction. River then handles retries, priority, concurrency, and scheduling.

Choosing an Approach

For most systems, polling with FOR UPDATE SKIP LOCKED at a 500ms to 1s interval is the right starting point. It is simple, requires no Postgres configuration changes, and scales horizontally by adding relay workers. The latency cost is real but tolerable for transactional event flows.

Add LISTEN/NOTIFY when you need event propagation in under 100ms and your team can manage the added complexity of a persistent dedicated connection and a fallback poll.

Reach for CDC and pglogrepl when you have high message volumes (tens of thousands per minute), existing logical replication infrastructure, and the operational maturity to monitor replication slot lag continuously.

In all three cases, the consumer must be idempotent. Store processed event IDs in a table with a primary key and use ON CONFLICT DO NOTHING on insert, or structure your business logic so that applying the same event twice is a no-op by design. The outbox gives you reliable delivery; idempotency gives you correctness on top of it.

Was this interesting?