The Kafka-Iceberg Storage Problem Ursa Is Trying to Solve

The familiar operational pattern for teams building analytical pipelines on top of event streams looks like this: produce to Kafka, run a Flink or Spark Structured Streaming job that reads from Kafka and writes to an Iceberg table, then query the Iceberg table for analytics. The Kafka layer handles real-time ingestion, backpressure, and ordering guarantees. The Iceberg layer handles queryability, schema evolution, time travel, and low-cost long-term retention.

This arrangement works, and for many teams it has been the default for years. It also stores the same data twice: once as Kafka log segments on broker disks, and again as Parquet files in the Iceberg table. It requires a continuous streaming job whose sole purpose is format conversion. It introduces a freshness lag between when events land in Kafka and when they appear in the lakehouse. For high-volume, long-retention topics, the storage cost multiplier is substantial.

Ursa, a new storage engine for Kafka described on topicpartition.io, frames this duplication as the central problem. The proposal is a lakehouse-first storage engine: write Kafka data directly to Iceberg-native format, collapsing the two tiers into one. The motivation is clear; the implementation is where the difficulty lives.

Why KIP-405 Did Not Close This Gap

KIP-405, tiered storage shipped in Kafka 3.6, allows brokers to offload older log segments to object storage. The local tier holds recent data, bounded by a configurable retention window. Segments that age past the threshold are uploaded to S3 or GCS and fetched back through the broker when consumers request cold data.

This solves the storage cost problem for long-retention topics without requiring broker-attached disks sized for the full retention period. It does not change the storage format. Remote segments are Kafka .log files: the binary record batch format Kafka has used since its earliest versions. They live on S3 but remain opaque to any tool outside the Kafka broker. Pointing Trino at a 90-day-old Kafka topic backed by KIP-405 storage means fetching through the broker rather than running a Parquet scan directly against the object store. The ETL job and the data duplication persist.

There is also a subtler problem: because the broker proxies reads from remote storage, cold consumer fetches consume broker CPU and network bandwidth even for workloads that have no operational need to pass through the broker. The remote tier saves on storage cost while adding a new bottleneck on the read path for analytical workloads.

Object-Storage-First Kafka Got the Medium Right

AutoMQ and WarpStream took a more radical position: use object storage as the primary storage medium, not just the cold tier. AutoMQ retains only a small write-ahead log on local disk for low-latency durability acknowledgment, flushes stream data to S3 as its primary store, and makes brokers mostly stateless. Partition reassignment becomes a metadata pointer swap rather than a multi-hour data copy. WarpStream, acquired by Confluent in 2024, eliminates broker-attached storage entirely; stateless agents write directly to S3, removing the inter-availability-zone ISR replication traffic that drives significant cost in high-throughput Kafka deployments on AWS.

Both systems achieve meaningful operational improvements. Neither writes Iceberg. The data on S3 remains in a Kafka-internal or proprietary format, and the ETL pipeline to the lakehouse still exists. Solving the storage medium problem left the storage format problem intact.

The Offset-Snapshot Mismatch

The core engineering challenge for any Iceberg-first Kafka is reconciling two incompatible progress models.

Kafka’s consumer protocol tracks progress using (topic, partition, offset) triples. An offset is a monotonically increasing integer assigned to each record within a partition. Consumer groups commit offsets, resume from them, seek to specific positions, and measure lag by comparing producer and consumer offsets. This model underlies the entire Kafka ecosystem: connectors, stream processors, consumer group management tooling, and observability infrastructure.

Apache Iceberg tracks state using snapshots. A snapshot is an immutable point-in-time view of a table, expressed as a pointer to a manifest list, which lists manifest files, which list Parquet data files. There are no per-record offsets anywhere in this hierarchy. A snapshot says: these files exist at this moment in time. It does not say: record number 4,821,044 is in row 7 of this Parquet file.

Kafka model:
  partition 0: [offset 0, offset 1, ..., offset 4821044]
  consumer group committed at offset 4000000, can resume from 4000001

Iceberg model:
  snapshot 8273659182
    manifest-list.avro
      manifest-001.avro -> data-001.parquet  (rows 0-131071)
      manifest-002.avro -> data-002.parquet  (rows 0-131071)
  no per-record offset; the snapshot is the atomic unit of progress

A storage engine serving both models must maintain a mapping from Kafka partition offsets to Iceberg file positions. This mapping must survive compaction runs, broker restarts, and Iceberg snapshot expiration. A Kafka consumer seeking to offset 4,000,001 must be translated to a specific byte range within a specific Parquet file, efficiently, on every fetch.

Parquet is not designed for this access pattern. Row groups are typically 128 MB to 256 MB, sized for analytical scan throughput. Per-record random access requires an auxiliary offset index, a lightweight structure mapping partition offsets to row group boundaries, stored alongside the Parquet files and kept consistent with the Iceberg snapshots that reference those files. If the index diverges from the actual file contents, consumers receive incorrect data or fail with seek errors.

Three Architectural Approaches

There are meaningfully different designs that all qualify as Iceberg-first, and they make different tradeoffs.

Write Parquet, derive offsets from a sidecar index. The broker writes Parquet files to object storage, commits Iceberg snapshots, and maintains an offset-to-row-range mapping as an auxiliary artifact alongside the data files. Kafka consumers use standard offset semantics; the storage layer resolves offset seeks via the sidecar index. This preserves full protocol compatibility at the cost of index consistency overhead and an additional artifact that must be kept in sync with the Iceberg metadata.

Snapshot ID as the consumer’s unit of progress. Map Iceberg snapshot IDs to logical consumer positions and track progress at snapshot granularity rather than per-record. This simplifies the storage engine considerably, but breaks consumers that rely on fine-grained offset semantics: lag monitoring tools, per-message replay, and any consumer that commits offsets more frequently than snapshot boundaries. That covers most of the Kafka ecosystem.

Embed offsets as a Parquet column. Assign offsets at write time and store them as a regular column in the Parquet files. Seeking to an offset becomes an Iceberg scan with a predicate on that column, using Parquet column statistics (lower and upper bounds stored per row group in the Iceberg manifest file entries) to prune irrelevant row groups. Offset semantics emerge from Iceberg’s predicate pushdown machinery, without a separate sidecar index to keep in sync. The tradeoffs are slightly larger files, additional scan planning overhead for seeks into large historical datasets, and the requirement that offsets be assigned before writing Parquet, which moves offset assignment from the partition log to the storage engine itself.

The third approach has the most architectural coherence for a system where the Iceberg table is the authoritative state. It is also the least obvious, and it requires that the storage engine own offset assignment rather than deriving it from Kafka’s internal log machinery.

Compaction and Consumer Progress

High-frequency Iceberg commits, which are necessary to keep consumer lag in the seconds rather than minutes, produce many small Parquet files. A stream writing 100,000 records per second across 100 partitions with 1-second commit intervals produces 100 Parquet files per second, 360,000 per hour. Iceberg’s scan planning overhead scales with manifest and file count. S3 GET overhead per file makes full-table scans increasingly expensive as file counts grow. Compaction is not optional in a streaming Iceberg system.

The integration of compaction with Kafka consumer progress is where the design gets genuinely complex. When a compaction job rewrites 100 small Parquet files into 1 large file, the offset index or embedded offset column must be updated atomically with the new Iceberg snapshot that reflects the new file layout. Consumers reading from any of the old files during the compaction window have two options: complete their reads against the pre-compaction files, which requires the storage engine to retain those files until all in-flight reads complete, or fail and retry from their last committed offset.

Retaining pre-compaction files until consumers advance is structurally similar to how Iceberg handles positional delete file merges: old delete files are kept until a compaction snapshot makes them unnecessary, and the engine serves correct results across the transition window. Adapting this approach to streaming consumer progress requires tracking which consumers hold references to which pre-compaction file versions, which is a non-trivial liveness tracking problem at high consumer counts. The retry approach is simpler to implement but places the burden on consumers to have reliable committed offsets and retry logic, which is a reasonable assumption for Kafka consumers but shifts operational responsibility outward.

What the Design Has to Get Right

The elimination of the Kafka-to-Iceberg ETL pipeline is a legitimate goal. For teams running high-volume streams with long analytical retention, dual storage is expensive, freshness latency matters, and the streaming job bridging the two tiers adds failure surface and operational complexity. Collapsing stream storage and table storage into a single Iceberg-native layer removes all three costs simultaneously, and lets tools like Trino or DuckDB scan the same files that Kafka producers just wrote, without any intermediate copy.

The questions worth tracking as Ursa matures: how it handles the offset-to-row-range mapping (sidecar index, snapshot mapping, or embedded column), what compaction contract it offers to in-flight consumers, and what the write latency profile looks like compared to disk-based Kafka and object-storage-first systems like AutoMQ and WarpStream. The S3 PUT latency floor, roughly 50 to 200 milliseconds, is workable for most streaming use cases, as both of those systems have demonstrated in production. The additional constraint for an Iceberg-first design is that files must be large enough for efficient analytical queries, which is in tension with the commit frequency that low-latency streaming consumers expect.

That tension between streaming freshness and analytical file efficiency is the engineering problem at the center of the design. How it is resolved determines the system’s practical applicability across the range of use cases that Kafka currently serves.