Kafka Without the ETL Step: What an Iceberg-First Storage Engine Actually Changes
Source: lobsters
Kafka stores data in a format that no analytics engine outside the Kafka ecosystem can read natively. Every message lands in a .log segment file: a compact binary layout encoding offsets, timestamps, headers, keys, and values in a proprietary format optimized for sequential disk I/O and replication. That format is very good at what it does. It is not, however, a Parquet file, and Spark, Trino, DuckDB, and every other query engine will not open it without help.
This is the root of a problem that has driven years of connector development. The Kafka Connect S3 Sink, Flink’s Iceberg sink connector, Hudi’s HoodieStreamer, Confluent’s Tableflow: every one of these tools exists to translate Kafka’s log into something a query engine can read. They differ in latency, operational complexity, and exactly-once guarantees, but they share the same structural constraint. The data lives in two places, and the copy in the lake is always behind the copy in Kafka.
Ursa approaches this differently. Rather than building a better translation layer, it replaces Kafka’s storage engine so that the broker writes Apache Iceberg data files natively. The Kafka topic and the Iceberg table become the same artifact, not two synchronized copies.
Why Connectors Hit a Ceiling
The current best practice for low-latency Iceberg ingestion from Kafka is a Flink cluster running the Iceberg sink, committing data files on each checkpoint. The checkpoint interval is the fundamental latency floor: you cannot make data available in the Iceberg table faster than you can complete a Flink checkpoint, and completing a checkpoint under load takes time. A 60-second checkpoint interval at 100K records/sec with 1 KB messages produces roughly 100 MB Parquet files, which is a reasonable target size. Drop to 10-second checkpoints and you get 10 MB files; at 5 seconds, you are at 5 MB, well into the small-file regime where Iceberg’s metadata layer starts to pay a real cost.
Small files are a structural problem in Iceberg. The spec organizes data files into manifests (Avro files listing data file paths and column-level statistics), and manifests into a manifest list per snapshot. Query planning must scan the manifest list to identify relevant files, then scan manifests to apply file-level pruning. When a table has millions of small files, that manifest scan dominates query planning time even before a single data byte is read.
The operational response is compaction: Iceberg’s RewriteDataFilesAction merges small files into larger target files, rewriting manifests in the process. This works, but it introduces a second process to operate, one that must run continuously enough to prevent manifest bloat, and that competes for I/O with the streaming writer. At 1 TB/day ingestion, a compaction job running against the same S3 bucket is not a background concern.
Kafka’s own tiered storage (KIP-405, production-ready since Kafka 3.7) takes a different path. It keeps the log segment format intact and offloads sealed segments to object storage. Consumers fetch from remote segments transparently; the broker proxies reads from S3 when local retention has expired. This is operationally clean and eliminates most of the broker disk cost, but the segments on S3 are still Kafka binary format. No analytics engine reads them. Confluent’s Tableflow re-exposes tiered storage segments as Iceberg by maintaining a separate metadata layer over the uploaded segments, but it is a managed proprietary feature tied to Confluent Cloud.
What Iceberg-First Means at the Storage Level
The Apache Iceberg table format is not a storage engine; it is a specification for organizing files on object storage with a rich metadata layer on top. The hierarchy runs from data files (Parquet, ORC, or Avro) referenced by manifest files (Avro files containing file paths and per-column statistics), through a manifest list per snapshot, up to table metadata JSON and a catalog entry. Every write creates an immutable snapshot. Readers always operate on a consistent snapshot. This gives Iceberg ACID semantics on object storage without locking.
For Ursa to make Iceberg the native broker format, each Kafka partition maps to an Iceberg table. Producing a batch of messages means writing a Parquet file, updating the manifest, and committing a new Iceberg snapshot. The topic is registered in an Iceberg catalog at write time, so any Iceberg-compatible engine (Trino, Spark, DuckDB via the iceberg-rust reader) can query it without any ETL step.
The Kafka offset is the crux of the implementation challenge. Iceberg was designed for analytics: full table scans, column pruning via row group statistics, partition pruning via hidden partition transforms. It was not designed for point lookup by a monotonically increasing integer key, which is precisely what a Kafka consumer does when it fetches from offset N. With data sorted by offset (which it will be, since Kafka appends in order) and Parquet row group min/max statistics on the offset column, a consumer can skip all row groups where max_offset < N. This is efficient for tail reads, where most row groups are recent and only the last few are relevant. It degrades for historical reads against a table with many small row groups, which brings the small-file problem back through a different door.
The commit latency and file size tradeoff is the same equation as the Flink connector case, but now it is internal to the broker rather than managed by an external job. The difference is that the broker has context the connector does not: it knows the partition throughput, the consumer lag, and the checkpoint state. A storage engine that adapts its commit interval to current write rate can do something a stateless Connect worker cannot.
The Replication Problem
Kafka’s durability model depends on the ISR (In-Sync Replicas) protocol. Before acknowledging a produce request with acks=all, the leader waits for all in-sync followers to confirm they have written the data to their local log. This synchronous replication across broker disks is what makes Kafka durable against broker failure without external storage dependency.
With Iceberg on S3, the durability model shifts. S3 provides eleven nines of object durability through its own replication, so data files written to S3 are durable without Kafka-level replication. But the ISR protocol was designed around local disk writes, and adapting it to coordinate object storage commits requires rethinking what “acknowledged” means. Does a produce request complete when the Parquet write is flushed to S3? Does it complete when the Iceberg snapshot is committed and visible in the catalog? The latency of those operations is measured in tens to hundreds of milliseconds, compared to single-digit millisecond local disk writes on NVMe.
This is not a reason to abandon the approach. Apache Paimon, the closest open-source analog to Ursa, handles write durability through a combination of object storage durability and a write-ahead log for in-flight data. Paimon graduated as an Apache top-level project in 2024 and provides streaming lakehouse storage for Flink with LSM-tree structures on S3, supporting both append-only log tables and changelog tables with upsert semantics. Paimon’s design demonstrates that the replication problem is solvable; the solution just looks different from Kafka’s broker-to-broker ISR.
The partition reassignment case, by contrast, becomes much simpler with object storage backing. Moving a Kafka partition from one broker to another with local disk storage requires copying gigabytes of log data, a process that generates inter-broker traffic and takes time proportional to partition size. With Iceberg on S3, partition reassignment is a metadata operation: update the catalog entry, update the partition ownership in the cluster metadata. The data files stay where they are.
The Architecture That Collapses
The conventional streaming data architecture has three tiers. Kafka holds hot data for hours to days, bounded by broker disk capacity. A streaming ETL pipeline (Flink, Spark Structured Streaming, or Kafka Connect) reads from Kafka and writes to a data lake on object storage. Analytics engines query the lake. Operational consumers read from Kafka. These two read paths operate on different copies of the data, with a lag between them that is governed by the ETL pipeline’s commit frequency.
The Lambda architecture introduced batch and speed layers to handle this split, and the streaming lakehouse pattern emerged as an attempt to collapse them by making the lake queryable with low latency. Systems like Paimon, Apache Hudi with MergeOnRead tables, and Iceberg with Flink sinks all move in this direction, reducing the lag to minutes. But they still require a separate pipeline to populate the lake from Kafka.
If Kafka’s storage layer is Iceberg, the pipeline disappears. Topics are Iceberg tables in a catalog from the moment the first message is written. A Trino query and a Kafka consumer read from the same Parquet files with the same freshness, bounded only by the broker’s commit interval. Schema management consolidates: instead of maintaining a Confluent Schema Registry definition alongside a separate lake catalog schema and keeping them in sync, there is one Iceberg schema. Iceberg’s schema evolution spec handles column additions, renames, and type widening safely through field IDs rather than field names.
The cost model changes too. Tiered storage reduces broker disk cost by offloading old segments to S3, but those segments are still read through the broker for any consumer that seeks into them. With Iceberg on S3, analytics queries that read historical data bypass the broker entirely and read from object storage directly. At scale, the difference in egress and compute cost between proxied broker reads and direct S3 reads for analytics workloads is significant.
What Remains Open
Ursa’s direction is coherent, and the problems it targets are real. The implementation quality will depend on a handful of specific decisions that are not resolved by the architectural sketch alone: the offset-indexed read performance under high consumer concurrency, the internal compaction strategy under producers writing at millions of records per second, and the consistency semantics for object storage commits under the produce acknowledgment path.
Paimon’s published benchmarks show 200 to 400 MB/s ingestion throughput per Flink TaskManager with LSM-tree storage on object storage, and 2 to 5x better read performance versus Hudi MergeOnRead after compaction. These numbers are encouraging as a ceiling for what Iceberg-backed Kafka storage can achieve, but Paimon’s access patterns are Flink table queries, not Kafka consumer offset fetches. The consumer workload is the harder problem, and the benchmarks that matter for Ursa are the ones that measure it.
The Kafka ecosystem has a history of projects that looked compelling in design and difficult in production: tiered storage took years from initial KIP to general availability, KRaft took even longer. A new storage engine that changes the fundamental durability model of a broker is not a small undertaking. But the direction, collapsing the streaming lake into the broker’s own storage layer by treating Iceberg as the native format rather than the export target, is the right one. Every layer of ETL between Kafka and the lake is complexity that exists only because the two formats are different. Ursa’s premise is that they do not have to be.