Simulating Catalog and Table Conflicts

18 minute read

Published:

tl;dr

The root and leaves of table formats- catalog services and replacements for Parquet files- are getting a lot of attention in industry and academia. However, the inner nodes of table formats like Apache Iceberg limit write throughput to 1-2 commits per second, even running well-behaved workloads in optimistic conditions.

  1. Sustained commit rates above 1-2 commits/sec are unattainable without making some transactions uncommittable.

  2. Storage I/O is the primary bottleneck for single-table workloads, even with an unrealistically fast catalog. Catalog compare-and-set (CAS) latency up to 120ms adds only modest overhead for single-table workloads.

  3. IO cascades extend tail latency. Conflicts requiring work proportional to the number of snapshots committed not only increase tail latency, they also create IO convoys that space out commit attempts.

Why bother?

Table formats are designed for read-dominant workloads with low update rates: non-blocking reads, atomic updates, and cheap, scalable metadata at rest. Conflict is rare by construction. If table update rates are steady, it’s usually an ingest workload with a dedicated writer and occasional conflicts caused by maintenance transactions. In multi-writer workloads, write rates are expected to be low, and an exponential backoff will probe around spikes in load.

Table format commit protocols are like an airport without a control tower. Absent coordination, either aircraft land infrequently (read-dominant workload) or each aircraft takes off knowing when it’s supposed to land (dedicated or externally coordinated writers). The table format community seems to have reached a consensus that airports must have control towers (i.e., catalog services) to sustain higher throughput. So: when do we need a control tower?

Given what we learned about the latency of conditional operations in object stores, when does the cost of resolving conflicts limit throughput for a file-based catalog? This post will focus on single-table workloads. We’ll explore multi-table catalogs and architectural alternatives in future posts.

Background: Table Formats and the Catalog

Table formats like Apache Iceberg, Apache Hudi, Delta Lake, and Lance specify conventions used by participants to ensure both that readers access consistent snapshots, and that updates to table data are ordered. Unlike traditional database systems, writers autonomously self-certify the consistency of their transaction against the current state of the table before atomically installing a new snapshot. There is no coordination across transactions outside of storage; in many settings, the set of running transactions is neither recorded nor discoverable.

Write transactions follow a coarse-grained, optimistic 3-phase lifecycle: versioned action, validation, and commit. Versioned actions track the snapshot version for reads and sequester writes outside the tree of objects reachable from the visible table. Commit is a race: transactions attempt to atomically change the current version to reflect their updates. Successful commits update the shared/visible table state; failed commits must retry after repairing their prepared transaction.

Iceberg read and write paths. Reads: (1) get table root from catalog, (2) read root object, (3) follow snapshot pointers, (4) read data files. Writes: (1) record read snapshot, (2) write new data and metadata, (3) create new root, (4) CAS at catalog; on failure, (5) validate intervening commits, (6) write merged metadata, (7) retry CAS.
Read path and write path (not illustrated) in Apache Iceberg

Also unlike traditional database pages, data referenced in a snapshot contains only committed data. Readers may filter tombstoned data, but a snapshot never references uncommitted data written by an active transaction.

Catalog

The Iceberg Catalog- the saddest table of all time- manages a tiny amount of state i.e., the locations of the root object for every table1. Like the root of a CoW B-tree, an update to a table shadows all affected nodes including the root.

The catalog is a natural bottleneck for concurrent transactions… right?

Yes and no. Conflicts on this pointer will cause losing transactions to verify and repair table data committed since their read snapshot. If the data files and manifest files remain valid as-written, then the transaction only needs to merge metadata references. Concretely, it must read the current table metadata (JSON blob at the root of the table; ~1MiB but potentially larger) and a subset of the manifest list(s) committed since the read snapshot (tens, multiples of ~100KiB each). The transaction merges and writes a new manifest list and table metadata, and retries the commit at the catalog.

Conflicts detected at the catalog limit throughput, but the latency of the root is not necessarily the bottleneck.

Workload Mix: FastAppend (FA) and ValidatedOverwrite (VO) Transactions

Our workload will mix “light” and “heavy” transactions from Iceberg.

A FastAppend transaction is our “light” transaction type that appends new data files and metadata (blind writes). Conflicts with other transactions are resolved by reading the table manifest and the latest manifest list, merging and writing both to create a new snapshot, then retrying the commit at the catalog.

A ValidatedOverwrite transaction is our “heavy” transaction type: an overwrite with validation enabled. It rewrites and replaces data files and metadata, so it needs to examine all the manifest lists committed between its read snapshot and the current snapshot to check for conflicts. It can then write the merged manifest list and table metadata and retry the commit at the catalog.

In what follows, we assume that all conflicts are spurious, require no reads beyond the manifest list, and failures are caused only by exhausting the retry budget at the catalog.

Simulating Commit Throughput

Using a discrete event simulator (Endive2), we can submit a transaction workload against a virtual catalog and object store. The simulator models the transaction lifecycle, including the cost of preparing retries after conflicts, and the latency of storage operations.

1a We start with an optimistic baseline: what happens to transaction throughput/latency if we use an implausibly fast catalog (1ms) and only FastAppend (light) transactions in a single table. We sweep arrival rates from 20ms (50 commits/sec) up to 5000ms (0.2 commits/sec).

1b Then we add ValidatedOverwrite (heavy) transactions into the mix. The rate a single table can sustain is lower than whatever commit rate causes these transactions to fail. By analogy, the rate at which small planes land safely is irrelevant if your airport is ringed by wrecked cargo planes that circled until they ran out of fuel. Using the same 1ms catalog, we sweep the ratio of FA (light) and VO (heavy) transactions to see how the workload mix affects success rates and latency.

2a,2b With this baseline for a single table and an instant catalog, we add another dimension: catalog latency. We sweep the compare-and-set (CAS) latency from 1ms up to 120ms. We run this sweep for both FA-only and 90/10 FA/VO workloads to measure success rate and commit latency sensitivity to catalog latency, still for a single table.

Experiment Summary

ExpDescriptionFixedSweptConfigs
1aFA baseline, instant catalog1 table, 1 group, FA=100%, instant catalog (1ms), S3, conflicts=0%inter_arrival_scale10
1bFA/VO operation mix1 table, 1 group, instant catalog (1ms), S3, conflicts=0%fast_append_ratio
inter_arrival_scale
80
2aCatalog CAS latency (FA)1 table, 1 group, FA=100%, S3, conflicts=0%catalog_latency_ms
inter_arrival_scale
70
2bCatalog CAS latency (mix)1 table, 1 group, FA=90%/VO=10%, S3, conflicts=0%catalog_latency_ms
inter_arrival_scale
70
ParameterValuesDescription
inter_arrival_scale[20, 50, 100, 200, 300, 400, 500, 1000, 2000, 5000] msScale parameter for the exponential distribution of transaction inter-arrival times. Lower values correspond to higher transaction rates.
fast_append_ratio[1.0, 0.9, 0.8, 0.7, 0.5, 0.3, 0.1, 0.0]Ratio of FastAppend (light) transactions to ValidatedOverwrite (heavy) transactions in the workload mix. 1.0 means all transactions are FastAppend, while 0.0 means all transactions are ValidatedOverwrite.
catalog_latency_ms[1, 5, 10, 20, 50, 80, 120] msLatency of the catalog’s compare-and-set (CAS) operation in milliseconds. This models the time it takes for a transaction to attempt a commit and receive a response from the catalog.

In all experiments, the manifest list and table metadata sizes are fixed (10KiB and 100KiB, respectively). These do not use the conditional operations measured earlier, but unconditional GET and PUT operations for S3:

Latency distributions for S3 Standard (click to expand)

Distribution Parameters

GET (unconditional read)

Modeled as Lognormal(mu=ln(median), sigma), floored at min_latency_ms.

Operationmedian (ms)sigmamin_latency (ms)
GET270.6210

GET operations don’t include sizes because latency is dominated by fixed overheads at these sizes. The current simulator uses the size-based formula for PUT, but in these simulations latencies are drawn from the above lognormal distribution.

PUT (unconditional write)

Modeled as Lognormal(mu=ln(base + rate * size_MiB), sigma), floored at min_latency_ms.

Operationbase (ms)rate (ms/MiB)sigmamin_latency (ms)
PUT60200.2910

Percentiles

Operationp5p10p25p50p75p90p95p99
GET10121827416075114
PUT37425060738797118

All experiments use 5 seeds, retry=10, txn runtime mean=180s. Up to 4 I/O operations can run in parallel3. Each run simulates 1 hour, with the first and last 15 minutes excluded as warmup/cooldown. Config counts exclude seeds.

Transactions retry (10x) immediately rather than backing off, since the workload submits transactions at a steady rate. By default, Iceberg transactions retry 4 times with an exponential backoff that starts at 100ms, doubling up to 1 minute between attempts (max 30 minutes). This would be strictly worse than the “immediate retry” strategy for this workload.

Note: Each set of heatmaps often includes more than the two displayed by default. Click on an image to view the gallery and flip through them. The full set of plots is also here.

1a. Single table, FastAppend (FA) workload

In this workload, we have a single table and only FastAppend transactions. The catalog is “instant” with a fixed latency of 1ms. If we’re assuming clients follow the protocol as specified, then the only bottleneck should be the cost of preparing retries.

Success rate vs throughput for single-table FastAppend, 1ms catalog. 100% success up to 2.7 c/s, then 99% at 4 c/s, 74% at 6 c/s, 42% at 7 c/s, 19% at 7.8 c/s. P50, P95, P99 commit latency vs throughput for single-table FastAppend. P50 rises from 320ms to 960ms. P95 and P99 climb steeply, converging near 1750ms at saturation. Annotations show success rates declining: 99%, 74%, 42%, 19%.
  1. Single-table transactions with only trivial conflicts
Throughput (c/s)Success Rate (%)P50 Latency (s)P95 Latency (s)P99 Latency (s)Mean Retries
0.2 ± 0.0100.0 ± 0.00.32 ± 0.000.43 ± 0.010.53 ± 0.031.0
0.4 ± 0.0100.0 ± 0.00.32 ± 0.000.47 ± 0.010.59 ± 0.031.1
0.8 ± 0.0100.0 ± 0.00.33 ± 0.000.52 ± 0.010.67 ± 0.041.2
1.6 ± 0.0100.0 ± 0.00.34 ± 0.000.64 ± 0.010.85 ± 0.031.4
2.1 ± 0.0100.0 ± 0.00.35 ± 0.000.70 ± 0.010.96 ± 0.031.5
2.7 ± 0.099.9 ± 0.00.37 ± 0.000.86 ± 0.021.26 ± 0.041.8
4.0 ± 0.098.7 ± 0.10.47 ± 0.001.25 ± 0.021.67 ± 0.032.5
6.1 ± 0.074.0 ± 0.30.78 ± 0.011.72 ± 0.011.89 ± 0.004.5
6.9 ± 0.042.1 ± 0.10.90 ± 0.001.74 ± 0.011.89 ± 0.015.0
7.8 ± 0.019.0 ± 0.10.96 ± 0.011.74 ± 0.001.88 ± 0.015.3
Single-table, FastAppend (FA) workload

Around 2-3 commits/sec, transactions start to time out after 10 retries. For FastAppend transactions, the cost of preparing a retry is around 300ms, so when the arrival rate exceeds 3-4 commits/sec, we start to see failures and higher latency from repeated retries.

Another way of looking at it: 20 TPS per table (50ms arrival rate) is completely unattainable. Only 42% of transactions succeed, even in this rosy setting.

Takeaway: the most optimistic single-table transaction rate is low (3-4 commits/sec), dominated by the cost of reading, merging, and writing metadata in the object store.

1b. Single table, mixed workload (FastAppend (FA), ValidatedOverwrite (VO))

In this experiment, the arrival rate is shared for a mix of FastAppend and ValidatedOverwrite transactions. The ratio of FA to VO transactions (fast_append_ratio) on the y-axis is swept from 1.0 (all FA) down to 0.0 (all VO). The same arrival rates are swept as the previous experiment, with the same, “instant” catalog. The 1.0 row is the same configuration as in experiment 1a, but with different seeds.

Heatmap of FA success rate by inter-arrival time (x, 20-5000ms) and FA ratio (y, 0.1-1.0). At 20ms, FA success ranges from 20% (FA=1.0) to 75% (FA=0.1) as fewer FA transactions compete. At 200ms+, all ratios reach 99-100%. Heatmap of VO success rate by inter-arrival time and FA ratio. VO is far more sensitive: near 0% at 20ms for high FA ratios. At 100ms, ranges from 20% (FA=0.9) to 62% (FA=0.0). Requires 300ms+ inter-arrival for 98%+ success.
1b. Single-table, FastAppend (FA)/ValidatedOverwrite (VO) Success rates

Unsurprisingly, FastAppend transactions are more likely to commit as the mix includes more ValidatedOverwrite transactions; retrying a FA transaction is cheaper. However, as with a mix of 90% FastAppend and 10% ValidatedOverwrite transactions, the maximum sustainable arrival rate drops to around 2 commits/sec.

We’ll continue to measure FA-only as a baseline, but I’d argue that VO throughput is the “real” sustainable commit rate per table, following the protocol as specified. Higher rates require coordination outside the table format protocol; only participating writers (i.e., the catalog operator’s writers) get reliable commit latencies.

Heatmap of FA mean commit latency by inter-arrival time and FA ratio. Dominated by arrival rate: ~980ms at 20ms inter-arrival, ~327ms at 5000ms. FA ratio has little effect. Hatched cells where success is below 95% or 80%. Heatmap of VO mean commit latency by inter-arrival time and FA ratio. Reaches tens of seconds at high contention but is largely insensitive to FA ratio at a given arrival rate. Nearly all high-load cells are hatched (low success).
1b. Single-table, FastAppend (FA)/ValidatedOverwrite (VO) Latency

The latency heatmaps show that VO transactions are remarkably insensitive to the workload mix; there is almost no difference in p50/p95/p99 latencies for VO transactions for a given arrival rate. Until VO transactions start to fail, even a workload composed of 100% VO transactions looks similar to the 90/10 split. Weird.

To understand why, recall that VO transactions read all the manifest lists committed since the read snapshot to prepare a retry. The first commit attempt needs to read about three minutes of manifest lists (180s mean txn time) to prepare. This takes long enough that it will almost certainly fail. Its next attempt will read fewer manifest lists- only those committed while it was preparing- but it’s probably still too many to succeed on that attempt. When the transaction finally commits, its retries have effectively spaced out commit attempts, with (roughly) exponentially reduced retry work per attempt.

VO transaction retries form IO convoys that, in effect, order VO commit attempts from the workload. This makes sense, but it surprised me; I wouldn’t have anticipated the manifest list processing could be this expensive. Even with an immediate retry, p99 commit latencies approach 2.5 minutes at a 300ms arrival rate.

Takeaway: arrival rates above 2 commits/sec will make some transactions practically uncommittable, even if we assume all conflicts are trivial to repair and require no reads beyond the manifest list. Transactions that need to do work proportional to the size of the conflict- no matter how minor- will form IO convoys in settings with steady commit rates.

2a. Single table, varied catalog latency

Next we establish a new baseline, varying catalog latency for a single table with only FastAppend transactions. Our goal is to understand how sensitive success rates and latencies are to catalog latency for a single table. The bottom row of the success heatmap (i.e., 1ms catalog latency) is the same configuration as in experiment 1a, but with different seeds.

Heatmap of FA success rate by catalog CAS latency (y, 1-120ms) and inter-arrival time (x). At 200ms inter-arrival, success drops modestly from 99% (1ms CAS) to 76% (120ms CAS). At 500ms+, all CAS latencies achieve 99-100%. At 20ms, all collapse to 10-20%. Heatmap of FA mean commit latency by catalog CAS latency and inter-arrival time. At 5000ms inter-arrival, latency grows from 327ms (1ms CAS) to ~700ms (120ms CAS). At 20ms, all near 1s regardless of CAS. Hatched at high load.
Exp 2a: Single table, FastAppend only, varied catalog latency. Success rate (left) and mean latency (right) heatmaps showing sensitivity to catalog CAS latency (1–120ms) and inter-arrival time.

Results match intuition: the rate at which transactions start to fail is only slightly affected by catalog latency, since the bottleneck is the cost of preparing retries. FA-only workloads start to fail at 2-3 instead of 3-4 commits/sec at the higher catalog latencies.

Within an arrival rate, latency increases but retries don’t start to cause notable differences below 80ms catalog latency.

Takeaway: for a single table, the cost of preparing retries dominates the bottleneck, so catalog latency has limited impact on success rates and commit latency.

2b. 90/10 FA/VO workload, varied catalog latency

Next we add ValidatedOverwrite transactions back into the mix, with a 90/10 FA/VO workload.

Heatmap of FA success rate in 90/10 FA/VO mix by catalog CAS latency and inter-arrival time. Similar to 2a but slightly worse: at 200ms inter-arrival, drops from 99% (1ms CAS) to 78% (120ms CAS). At 500ms+, all reach 99-100%. Heatmap of VO success rate in 90/10 mix by catalog CAS latency and inter-arrival time. Near 0% at 20-50ms for all CAS latencies. At 200ms, drops from 84% (1ms CAS) to 41% (120ms CAS). Requires 500ms+ for 97%+ at low CAS, 1000ms+ at 120ms CAS.
Exp 2b: Single table, 90/10 FA/VO mix, varied catalog latency. Success rate heatmaps for FastAppend (left) and ValidatedOverwrite (right) across catalog latencies and inter-arrival times.

Unsurprisingly, FA transaction success rates are almost identical to 2a (FA-only) at all catalog latencies and inter-arrival times. At higher catalog latencies, VO success rates degrade more rapidly, between 1-2 commits/sec. At 120ms commit latency, success rates drop earlier and more sharply, starting at 1 commit/sec and plummeting thereafter.

Heatmap of FA mean commit latency in 90/10 mix by catalog CAS latency and inter-arrival time. Similar to 2a. At 5000ms inter-arrival, ranges from 328ms (1ms CAS) to 707ms (120ms CAS). Hatched at high load. Heatmap of VO mean commit latency in 90/10 mix by catalog CAS latency and inter-arrival time. Reaches tens of seconds at high arrival rates across all CAS latencies. At 500ms inter-arrival, ranges from seconds (1ms CAS) to tens of seconds (120ms CAS). Nearly all cells hatched.
Exp 2b: Single table, 90/10 FA/VO mix, varied catalog latency. Mean latency heatmaps for FastAppend (left) and ValidatedOverwrite (right) across catalog latencies and inter-arrival times.

Mean VO latencies are modestly higher as the catalog latency increases, but continue to be mostly determined by the arrival rate.

Takeaway: Catalog latency has a more pronounced, but minor effect on VO success rates in a mixed (90/10) workload below 80ms. At rates where VO transactions are viable (above 1s inter-arrival time), catalog latency has a minor impact on commit latency.

Conclusions

Single-table commit throughput is limited by the cost of preparing retries, not by the catalog. With an unrealistically fast (1ms) catalog, FastAppend-only workloads top out at 3-4 commits/sec; adding even 10% ValidatedOverwrite transactions drops the sustainable rate to around 2 commits/sec. Below these rates, catalog CAS latency adds modest per-commit overhead but does not change the success rate. Above them, no catalog- however fast- can help.

The most interesting finding is that ValidatedOverwrite transactions form IO convoys under load. Because each retry must read manifest lists proportional to the number of snapshots committed since the read snapshot, the work to prepare a retry grows with contention. Retries effectively serialize VO commit attempts, with p99 latencies reaching minutes even at moderate arrival rates. This is not a catalog bottleneck or an object store bottleneck- it is the cost of the table format’s own metadata protocol.

These are optimistic results. All conflicts are trivial (no real data to re-read), retries are immediate (no backoff), and the catalog is implausibly fast. Real workloads include compactions, GDPR deletions, and non-trivial overwrites that would only widen the gap. The protocol- as specified- limits single-table commit rates to low single digits per second.

In the next post, we use the CAS latency distributions measured across S3, S3 Express, Azure, and GCS to model multi-table catalogs stored in object storage. Distributing transactions across tables moves the bottleneck from per-table metadata I/O to catalog contention- a bottleneck that conditional operations can address.

  1. Also some string property maps and namespaces, but it’s barely pushing kilobytes. 

  2. The state of the repository reflects what it is: an active hobby workshop. Scraps of notes, half-baked code, and uncurated history. If it makes assumptions that need correction, I want them to be easy to find. It’s not intended as a polished, “release” of anything; it’s a sandbox, but it’s generating results that make sense to me. Running these simulations takes hours, so email me if you want the data from a full run (~6GB) of these experiments and I’ll upload them. 

  3. The default in Iceberg is the number of logical processors.