· 7 min read ·

The Relay Is the Hard Part: Implementing the Outbox Pattern in Go and Postgres

Source: lobsters

Distributed systems eventually force a specific conversation: what happens when you need to write to your database and publish a message to a broker, and you cannot guarantee both succeed? The naive answer is to do both sequentially and hope. The honest answer is that this fails in production, and the fix has a name: the transactional outbox pattern.

This video walkthrough covers the mechanics of the pattern in Go and Postgres, and it is a solid introduction. But the table schema is the part everyone figures out quickly. The part that generates real engineering decisions is the relay, the process that reads from the outbox and forwards messages to the broker. That is where the trade-offs live, and that is what deserves more attention.

Why the Dual-Write Problem Does Not Have an Easy Fix

The core issue is that Postgres and your message broker (Kafka, NATS, RabbitMQ, whatever) are separate systems. There is no transaction that spans both. If you commit to Postgres and then the broker publish fails, you have lost the event. If the broker publish succeeds but the Postgres commit rolls back, you have a phantom event. Neither outcome is acceptable for anything that matters.

XA transactions exist to solve exactly this, but in practice they are slow, poorly supported, and operationally painful. Nobody running Go services on Postgres is reaching for XA.

The outbox pattern sidesteps this by writing the event into Postgres itself, in the same transaction as your business data:

func (s *OrderService) CreateOrder(ctx context.Context, req CreateOrderRequest) error {
    return pgx.BeginTxFunc(ctx, s.db, pgx.TxOptions{}, func(tx pgx.Tx) error {
        order := buildOrder(req)
        if _, err := tx.Exec(ctx,
            "INSERT INTO orders (id, user_id, total) VALUES ($1, $2, $3)",
            order.ID, order.UserID, order.Total,
        ); err != nil {
            return err
        }

        payload, _ := json.Marshal(OrderCreatedEvent{
            OrderID: order.ID,
            UserID:  order.UserID,
            Total:   order.Total,
        })
        _, err := tx.Exec(ctx,
            `INSERT INTO outbox (id, aggregate_type, aggregate_id, event_type, payload)
             VALUES (gen_random_uuid(), $1, $2, $3, $4)`,
            "order", order.ID, "OrderCreated", payload,
        )
        return err
    })
}

Either both rows commit or neither does. The event cannot be lost at write time. The remaining problem is getting it from the outbox table to the broker reliably, and that is where designs diverge.

Polling: Simpler Than It Looks

The straightforward approach is a background loop that queries pending rows and publishes them. The part most introductions skip is FOR UPDATE SKIP LOCKED, the Postgres feature that makes this safe to run with multiple relay instances:

func (r *Relay) publishBatch(ctx context.Context) error {
    tx, err := r.db.Begin(ctx)
    if err != nil {
        return err
    }
    defer tx.Rollback(ctx)

    rows, err := tx.Query(ctx, `
        SELECT id, aggregate_type, aggregate_id, event_type, payload
        FROM outbox
        WHERE published_at IS NULL
        ORDER BY created_at
        LIMIT $1
        FOR UPDATE SKIP LOCKED
    `, r.batchSize)
    if err != nil {
        return err
    }
    defer rows.Close()

    var ids []uuid.UUID
    for rows.Next() {
        var msg OutboxMessage
        rows.Scan(&msg.ID, &msg.AggregateType, &msg.AggregateID, &msg.EventType, &msg.Payload)
        topic := fmt.Sprintf("%s.events", msg.AggregateType)
        if err := r.broker.Publish(ctx, topic, msg.AggregateID, msg.Payload); err != nil {
            return err
        }
        ids = append(ids, msg.ID)
    }

    if len(ids) == 0 {
        return tx.Rollback(ctx)
    }

    _, err = tx.Exec(ctx,
        "UPDATE outbox SET published_at = now() WHERE id = ANY($1)",
        ids,
    )
    if err != nil {
        return err
    }
    return tx.Commit(ctx)
}

SKIP LOCKED tells Postgres to skip rows that another transaction has locked rather than blocking. Multiple relay instances each get non-overlapping batches, and you get horizontal scaling for free without any coordination layer. This is a genuinely useful Postgres feature that appears in several queueing patterns beyond the outbox.

The schema needs a partial index or the relay will table-scan as the outbox grows:

CREATE INDEX idx_outbox_pending
    ON outbox (created_at)
    WHERE published_at IS NULL;

Polling latency is bounded by your poll interval, typically 100ms to a few seconds. For most workloads this is fine. The relay is also doing constant work even when the outbox is empty, which is wasteful but predictable.

WAL-Based CDC: Lower Latency, Higher Complexity

Postgres writes every change to its Write-Ahead Log before touching data pages. Logical replication decodes that log into row-level events. Rather than polling, a CDC relay consumes the WAL stream and publishes immediately after commit, with latency typically under 100ms.

The Go library pglogrepl (part of the pgx ecosystem) lets you consume the WAL directly without running Debezium or any JVM process:

-- One-time setup
ALTER SYSTEM SET wal_level = logical;
SELECT pg_create_logical_replication_slot('outbox_slot', 'pgoutput');
CREATE PUBLICATION outbox_pub FOR TABLE outbox;
import (
    "github.com/jackc/pgx/v5/pgconn"
    "github.com/jackc/pglogrepl"
)

conn, _ := pgconn.Connect(ctx,
    "postgres://user:pass@localhost/db?replication=database")

pglogrepl.StartReplication(ctx, conn, "outbox_slot", 0,
    pglogrepl.StartReplicationOptions{
        PluginArgs: []string{
            "proto_version '1'",
            "publication_names 'outbox_pub'",
        },
    })

The trade-off is operational: replication slots retain WAL segments until consumed. A stalled relay means WAL accumulates on disk, and if Postgres runs out of space the entire cluster goes read-only. Postgres 13 introduced max_slot_wal_keep_size, which invalidates a slot rather than filling the disk, but an invalidated slot means the relay loses its position and needs recovery logic. Neither outcome is fun at 3am.

This is why WAL-based approaches suit teams with dedicated platform infrastructure. The latency improvement over polling is real, but the operational surface area is meaningfully larger.

The Hybrid Nobody Talks About: LISTEN/NOTIFY + Polling

There is a middle path that gets less attention: use Postgres’s built-in LISTEN/NOTIFY mechanism to wake the relay immediately after an insert, while keeping a polling fallback for reliability. A trigger fires a notification on each outbox insert, the relay processes it immediately, and the background poll catches anything missed during relay downtime.

CREATE OR REPLACE FUNCTION notify_outbox_insert()
RETURNS trigger AS $$
BEGIN
    PERFORM pg_notify('outbox_insert', NEW.id::text);
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER outbox_insert_trigger
    AFTER INSERT ON outbox
    FOR EACH ROW EXECUTE FUNCTION notify_outbox_insert();
func (r *Relay) Run(ctx context.Context) error {
    // Dedicated connection for LISTEN
    listenConn, err := pgx.Connect(ctx, r.dsn)
    if err != nil {
        return err
    }
    defer listenConn.Close(ctx)

    listenConn.Exec(ctx, "LISTEN outbox_insert")

    // Slow fallback poll
    ticker := time.NewTicker(5 * time.Second)
    defer ticker.Stop()

    for {
        select {
        case <-ticker.C:
            r.publishBatch(ctx) // safety net
        case <-ctx.Done():
            return ctx.Err()
        default:
            // Check for notification without blocking indefinitely
            notifyCtx, cancel := context.WithTimeout(ctx, 100*time.Millisecond)
            _, err := listenConn.WaitForNotification(notifyCtx)
            cancel()
            if err == nil {
                r.publishBatch(ctx) // immediate relay
            }
        }
    }
}

This achieves 5 to 20ms delivery latency after commit, with no replication slot, no extra infrastructure, and the polling safety net as a backstop. NOTIFY messages are not durable (they are lost if no listener is connected), which is exactly why the poll fallback matters. The relay processes any undelivered rows when it reconnects.

For most Go services that do not already operate a Kafka Connect cluster, this hybrid is underrated. It is less complex than WAL-based CDC, more responsive than pure polling, and the only infrastructure dependency is Postgres itself.

Using River for the Whole Thing

If your consumers are within the same service cluster, River eliminates the broker entirely. It is a Postgres-native job queue built on pgx v5, and InsertTx participates in your existing transaction, making it a direct outbox implementation:

// In your business transaction
if _, err := riverClient.InsertManyTx(ctx, tx, []river.InsertManyParams{
    {Args: OrderCreatedArgs{OrderID: order.ID, UserID: order.UserID}},
    {Args: SendConfirmationEmailArgs{To: req.Email}},
}); err != nil {
    return err
}
// The jobs only exist if the outer transaction commits

River handles retries, backoff, dead-letter, and worker concurrency. If you do not actually need a message broker in the architecture, this is probably the cleanest option available for Go services on Postgres today.

The Consumer Side Is Also Part of the Contract

The outbox pattern delivers at-least-once semantics. The relay can crash after publishing but before marking the row, sending the same message twice. Consumers need to be idempotent, and the standard tool for that is an inbox table on the receiving side:

CREATE TABLE inbox (
    event_id   UUID PRIMARY KEY,
    topic      TEXT NOT NULL,
    processed_at TIMESTAMPTZ DEFAULT now()
);
func processEvent(ctx context.Context, tx pgx.Tx, eventID uuid.UUID, fn func() error) error {
    var exists bool
    tx.QueryRow(ctx,
        "SELECT EXISTS(SELECT 1 FROM inbox WHERE event_id = $1)",
        eventID,
    ).Scan(&exists)
    if exists {
        return nil
    }
    if err := fn(); err != nil {
        return err
    }
    _, err := tx.Exec(ctx,
        "INSERT INTO inbox (event_id) VALUES ($1)",
        eventID,
    )
    return err
}

The event_id should come from the outbox row’s id, threaded through the broker message headers so consumers can look it up.

Cleaning Up

Processed outbox rows accumulate. A background delete on published_at works for moderate volume, but large tables benefit from range partitioning by created_at with monthly partitions, which lets you drop entire partitions instead of running slow deletes against a hot table.

CREATE TABLE outbox (
    id              UUID,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    -- ... other columns
    PRIMARY KEY (id, created_at)
) PARTITION BY RANGE (created_at);

CREATE TABLE outbox_2026_03
    PARTITION OF outbox
    FOR VALUES FROM ('2026-03-01') TO ('2026-04-01');

DROP TABLE outbox_2026_01 is instantaneous regardless of how many rows are in the partition. That is the cleanup story for high-volume outboxes.

Picking an Approach

For most Go services starting with the outbox pattern: polling with FOR UPDATE SKIP LOCKED and a partial index is the right default. Add LISTEN/NOTIFY if latency becomes a concern. Reach for WAL-based CDC or Debezium when you already operate Kafka Connect and need sub-100ms delivery at scale. Use River when there is no external broker in the picture.

The table design is the same in all cases. The relay strategy is the actual architectural decision.

Was this interesting?