· 6 min read ·

When the Broker Becomes the Writer: What Iceberg-Native Kafka Storage Actually Requires

Source: lobsters

For most teams running Kafka at any meaningful scale, there are really two data systems in production. The first is Kafka itself: a cluster of brokers managing topic-partitions, writing binary log segments to local disk or tiered object storage, serving consumers via the Kafka protocol. The second is whatever lakehouse they built downstream of it: an Iceberg or Delta Lake table on S3, fed by a Flink job or Kafka Connect pipeline, queryable by Spark and Trino. Both systems hold the same data. Both have their own retention policies, their own operational surface area, and their own failure modes. The lag between data landing in Kafka and becoming visible to an analytics query is measured in minutes on a good day.

Ursa proposes something more fundamental than yet another connector: make Iceberg the broker’s primary storage format. Not a downstream destination, but the thing the broker writes directly. The Kafka topic and the Iceberg table become the same object.

What Kafka Actually Writes to Disk

Kafka’s storage model is a binary append log. Each topic-partition is a directory containing .log files (the message data), .index files (a sparse offset-to-physical-position map sampled every ~4KB), and .timeindex files (timestamp-to-offset maps). A new log segment rolls when the active segment hits a size threshold (default 1 GB via log.segment.bytes) or a time window. The format is a sequence of RecordBatch structures: a fixed header carries the base offset, last offset delta, max timestamp, producer ID, producer epoch, and compression type, followed by the compressed batch payload.

This format is excellent for what Kafka does. Sequential writes are fast. The index structure supports efficient offset seeks. Compression per batch works well because adjacent messages tend to share structure. But the format is entirely opaque to anything outside the Kafka ecosystem. Spark cannot read a Kafka .log file. Trino cannot push predicates into it. DuckDB cannot time-travel through it.

KIP-405 tiered storage, which shipped in Kafka 3.6, took a reasonable but limited step: offload older log segments to object storage via a pluggable RemoteStorageManager interface, keeping only recent segments on local disk. This solved cost and retention at scale. What it did not do is change the format. The files uploaded to S3 are the same binary .log, .index, and .timeindex files. They are not queryable by any analytics engine. You still need the ETL pipeline. You still have two systems.

The ETL Tax

The standard path from Kafka to Iceberg today involves running either a Flink job with the iceberg-flink module or a Kafka Connect pipeline using an Iceberg sink connector. Both consume from Kafka and write Parquet files plus Iceberg metadata to object storage. Both introduce lag, typically a few minutes between message production and snapshot commit. Both add operational weight: another job to monitor, another set of consumer group offsets to track, another source of schema drift when the producer changes a field.

There is also a less obvious cost: duplicate storage. During retention overlap, the data lives in both Kafka and Iceberg. Teams end up with two retention policies that need to stay in sync, and two copies of the same bytes paying separate storage bills.

Iceberg-Native: What It Actually Means

Apache Iceberg structures data as a three-layer hierarchy. At the top, a catalog points to a metadata JSON file. The metadata file tracks schema history, partition spec history, and a list of snapshots. Each snapshot references a manifest list, which is an Avro file enumerating manifest files. Each manifest file lists individual data files, which are typically Parquet, along with per-file column statistics: null counts, min and max values per column. These statistics are what enable partition pruning and file skipping during query planning.

In an Iceberg-native Kafka storage engine, the broker writes Parquet data files and Iceberg metadata directly, rather than Kafka binary log segments. The topic is the table. A consumer using the standard Kafka protocol still works: the broker reads from Parquet and serializes records back into the Kafka wire format on the read path. But a Spark job or Trino query can also scan the same table directly, with no pipeline in between.

This is the architectural shift Ursa represents. The value is real: zero-copy analytics, time travel on your event stream, schema enforcement at the broker rather than hoping downstream consumers agree on field types, and predicate pushdown when reading historical data.

The Problems This Creates

Making Iceberg the primary storage layer is not a straightforward substitution. Several deep mismatches between Kafka’s design and Iceberg’s need to be resolved.

The small file problem. Kafka is designed for high-throughput small message writes. Parquet is designed for large columnar batches. A Parquet row group is typically 128 MB to 1 GB; writing a file per Kafka batch produces millions of tiny Parquet files, each with its own metadata overhead and poor compression ratios. The broker must buffer messages in memory or on local storage before flushing a row group, which introduces a latency and durability trade-off. This is not unsolvable, AutoMQ’s architecture shows that buffering before object storage writes can be managed, but it requires explicit design rather than a direct log segment replacement.

Consumer offset compatibility. Kafka offsets are sequential integers assigned per partition. Iceberg has no concept of offsets; it has snapshot IDs and sequence numbers, but these do not map cleanly to per-record positions. An Iceberg-native broker must maintain an offset-to-Parquet-position index, which essentially reconstructs one of the things Kafka’s .index file currently provides. The implementation surface here is non-trivial, especially across partition reassignments and leader failovers.

Replication semantics. Kafka’s replication protocol works at the RecordBatch level: the leader appends to its local log, followers fetch and replicate the same byte sequence, and the high-water mark advances once a quorum has acknowledged. With Iceberg-native storage, the unit of replication changes. If the leader writes a Parquet file to shared object storage and followers reference the same file, you get object storage as the replication layer, which is how AutoMQ approaches it. But this changes durability and consistency semantics in ways that interact with Kafka’s existing guarantees around acknowledged offsets.

Compacted topics. Kafka’s log compaction keeps only the most recent record per key, running as a background LogCleaner thread. Iceberg v2 supports equality delete files, which express row-level deletes without rewriting data, but the semantics differ from Kafka compaction. A mapping between compaction behavior and Iceberg’s position-delete or equality-delete mechanisms must be defined explicitly.

How This Differs from AutoMQ

AutoMQ takes Kafka’s storage to object storage while preserving the binary log format. Brokers become stateless, all data lands on S3, partition reassignment becomes instant because there is no data to move. It solves the cost and elasticity problem well. It does not solve the analytics problem at all. You still need a separate pipeline to get data into Iceberg.

Ursa’s approach is orthogonal. The target is not just cheaper storage, but eliminating the boundary between the streaming system and the table format entirely. These are different problems with different trade-offs; Ursa necessarily accepts more write-path complexity in exchange for a unified data representation.

What Becomes Possible

If the broker writes Iceberg natively, several things that currently require separate systems fall out for free. Time travel: query a topic as of any historical snapshot. Schema evolution: add a column to the Iceberg table and the broker enforces it at ingest rather than discovering the mismatch downstream. Direct SQL access: SELECT * FROM kafka_topic WHERE event_type = 'checkout' AND ts > now() - interval '7 days' with full Parquet column statistics-based file skipping. No connector to operate, no consumer group lag to monitor, no duplicate retention policies to keep synchronized.

S3 Express One Zone, AWS’s single-AZ low-latency object storage variant, brings write latency into the single-digit millisecond range for small objects, which makes object-storage-primary architectures considerably more viable for near-real-time workloads than they were two years ago.

The trade-offs are real: write latency is higher than local-disk Kafka, the Iceberg catalog becomes a critical dependency, and small-message, high-frequency producers push hard against Parquet’s design assumptions. But for workloads where the event stream and the analytics table are the same logical entity, the current two-system architecture has always been a workaround rather than a solution. Ursa is a serious attempt to treat it as one.

Was this interesting?