Simulating Multi-Table Contention in Catalog Formats

28 minute read

Published:

tl;dr

Update: Corrected simulations 2026-06-09

[Part 1] [Part 2]

Table formats like Apache Iceberg were designed before conditional operations were widely available in object stores. These operations are sufficient to support Iceberg’s linearizable table update protocol, but how would they perform? Simulating multi-table commit contention at the catalog suggests:

  1. Partitions help throughput and VO tail latency. Distributing a uniform workload across 20 tables roughly doubles aggregate throughput for slower providers (1.5–2.5x) and compresses VO mean latency by 5–8x. With zipfian skew, the most popular table converges to single-table performance, but other tables are mostly unaffected by the hot table’s contention.

  2. Provider choice is a larger lever than table count. S3 Express One Zone (S3x) sustains 14.6 c/s on a single table- more per-table throughput than spreading a workload across 50 tables on S3 Standard, even with an “instant” catalog. The entire commit pipeline (CAS + manifest I/O) compresses with faster storage; adding tables only helps with catalog contention.

  3. Longer-tailed distributions compound under contention. Each attempt requires multiple reads/writes in the object store. Variability extends the hazard window and makes workloads less stable overall. For example, S3 and Azure Premium have similar median CAS latency (61, 64ms), but Azure Standard’s longer tails result in more failures as it approaches saturation.

  4. GCS is not viable for catalog-as-file workloads. This follows from raw GCS CAS latency. Commit success degrades above ~0.4 commits/sec, whether the workload is pure FastAppend or a 90/10 mix. Adding tables barely helps: the per-table I/O cost is high enough that 50 tables only lifts the usable rate to ~0.7 c/s.

Usable throughput (>95% success) vs table count for all providers at FA/VO ratios of 100/0, 90/10, and 50/50. S3x sustains 14.9 c/s through 10 tables and jumps to ~36 c/s FA-only at 20+ tables (14.9 c/s for mixed workloads). S3, Azure Premium, and Azure Standard plateau at 3.7 c/s by 5 tables; S3 FA-only breaks out to 7.2 c/s at 50 tables. GCP climbs from 0.4 c/s (1 table) to 0.7 c/s at 5+ tables and flattens. Adding VO transactions drops single-table throughput for all providers but is recovered with 2+ tables for S3x and 5+ tables for others.
Simulated provider performance distributed over 1 to 50 tables (uniform). These rates are 20-50% of measured CAS saturation for these providers, due to commit protocol overhead.

The commit protocol bottleneck is well-known among table format developers; lifting it whole or in part into a dedicated service is a popular solution. Now that we’ve measured and characterized the protocol, we can explore those tradeoffs in a later post.

Commit Contention in Catalog Files

Previously we simulated single-table commit rates. Now we add another dimension: multiple tables in the catalog. These are not multi-table transactions, but rather independent table updates that physically conflict at the catalog. For example, if transactions T1 and T2 update tables A and B respectively, T1 successfully updating the catalog reference for A could cause T2 to fail its commit to B. However, repairing T2’s commit is cheaper than what we measured last time: T2 only needs to retry at the catalog, not rewrite the table metadata or its manifest list.

This models the “catalog as file” case where the entire catalog is conditionally replaced on every commit. Note that as the number of tables increases, the inter-arrival time is distributed across all the tables in the catalog; be careful not to read it as the arrival rate for a single table, which we measured before.

Experiment Summary

The workload mix is the same as in the single-table experiments, composed of “light” FastAppend (FA) and “heavy” validated overwrite (VO) transactions. The salient difference between FA and VO is the I/O necessary to retry a transaction: a FA transaction needs to re-read only the latest manifest list while a VO transaction needs to read the manifest lists of all new snapshots of that table.

The workload is steady, but optimistic: it assumes no transaction needs to read beyond the manifest list to investigate or repair a conflict before retrying. Real workloads include commit attempts that do more work between attempts, increasing the chance of conflict.

Tables are selected from either uniform or Zipfian distributions, as annotated.

ExpDescriptionFixedSweptConfigs
4aMulti-table contention (FA)1 group, FA=100%, S3, conflicts=0%num_tables
catalog_latency_ms
inter_arrival_scale
240
4bMulti-table contention (mix)1 group, FA=90%/VO=10%, S3, conflicts=0%num_tables
catalog_latency_ms
inter_arrival_scale
240
4cMulti-table, real providers1 group, conflicts=0%, backend=storageprovider
num_tables
fast_append_ratio
inter_arrival_scale
900
ParameterValuesDescription
inter_arrival_scale[20, 50, 100, 200, 300, 400, 500, 1000, 2000, 5000] msScale parameter for the exponential distribution of transaction inter-arrival times. Lower values correspond to higher transaction rates.
fast_append_ratio[1.0, 0.9, 0.8, 0.7, 0.5, 0.3, 0.1, 0.0]Ratio of FastAppend (light) transactions to ValidatedOverwrite (heavy) transactions in the workload mix. 1.0 means all transactions are FastAppend, while 0.0 means all transactions are ValidatedOverwrite.
catalog_latency_ms[ 1, 10, 50, 120]Latency of the catalog’s compare-and-set (CAS) operation in milliseconds. This models the time it takes for a transaction to attempt a commit and receive a response from the catalog.
num_tables[1, 2, 5, 10, 20, 50]Number of tables in the catalog. This models the contention at the catalog when multiple tables are being updated concurrently.
provider[s3x, s3, azurex, azure, gcp]Cloud storage provider used for the catalog. Each provider has different CAS latency distributions, which affect the commit success rates and latencies.

In all experiments, the manifest list and table metadata sizes are fixed (10KiB and 100KiB, respectively). Manifest and metadata I/O uses unconditional GET and PUT operations, not the conditional operations measured earlier. We use the same S3 Standard latencies for experiments 4a/4b as we used in the single-table experiments. We use provider distributions for experiment 4c (i.e., unconditional reads/writes for metadata, conditional writes for the catalog).

Latency distributions for S3 Standard (click to expand)

Distribution Parameters

GET (unconditional read)

Modeled as Lognormal(mu=ln(median), sigma), floored at min_latency_ms.

Operationmedian (ms)sigmamin_latency (ms)
GET270.6210

GET operations don’t include sizes because latency is dominated by fixed overheads at these sizes.

PUT (unconditional write)

Modeled as Lognormal(mu=ln(base + rate * size_MiB), sigma), floored at min_latency_ms.

Operationbase (ms)rate (ms/MiB)sigmamin_latency (ms)
PUT60200.2910

Percentiles

Operationp5p10p25p50p75p90p95p99
GET10121827416075114
PUT37425060738797118

Multi-table scaling (4a, 4b)

Before measuring real providers, we sweep the number of tables (1-50) and catalog CAS latency (1-120ms) to establish how much multi-table scaling can buy. Experiments 4a (FA-only) and 4b (90/10 FA/VO) use simulated S3 Standard latencies for manifest I/O, with synthetic CAS latencies. Full heatmaps are in Appendix A.

Uniform distribution: tables move the bottleneck to the catalog

With a fast catalog (1-10ms CAS), distributing FA transactions uniformly across more tables lifts success rates at high load, though the 20ms×50-tables corner is no longer a free lunch: 50 tables at 20ms inter-arrival reach 88% success at 1ms CAS, 70% at 10ms CAS. At 100ms+ inter-arrival (≤10 c/s offered), 10+ tables hit 100% success at CAS latencies up to 10ms.

Exp 4a: Heatmap of FA success rate by number of tables (1-50) and inter-arrival scale, 10ms CAS. More tables dramatically improve success: 50 tables achieve 69.5% at 20ms inter-arrival vs 13.9% with 1 table. 10+ tables reach 100% at 100ms+ inter-arrival. Exp 4a: Heatmap of mean commit latency by table count and inter-arrival scale, 10ms CAS. Latency drops with more tables: 50 tables at 20ms is 701ms vs 1443ms for 1 table. Baseline converges to 338-455ms at 5000ms inter-arrival. Hatched cells indicate low success rates.
Exp 4a: FA success rate and latency by table count and inter-arrival time (10ms, 1ms CAS). More tables shift the bottleneck from per-table metadata I/O to the catalog.

At CAS latencies closer to real providers (50-120ms), the frontier tightens. At 50ms CAS, the knee (>95% success rate) sits near ~10 c/s with 20+ tables. At 120ms CAS, the knee drops to ~5 c/s- the catalog round-trip dominates retry cost at every table count.

Exp 4a: Heatmap of FA success rate by table count and inter-arrival scale, 50ms CAS. Worse than 10ms: 1 table at 20ms is 11.5%, 50 tables at 20ms is only 35.1%. Even 20-50 tables need 200ms+ inter-arrival for 99%+ success. Exp 4a: Heatmap of mean commit latency by table count and inter-arrival scale, 50ms CAS. Higher baseline than 10ms: 1 table at 5000ms is 578ms vs 455ms. At 20ms, 1 table reaches 1935ms and 50 tables 1218ms. Nearly all low-arrival cells are hatched.
Exp 4a: FA success rate and latency by table count (50ms, 120ms CAS). At realistic CAS latencies, table count provides less relief.

Adding VO: table partitioning reduces per-table retry cost

Adding 10% VO transactions barely changes FA success rates. VO success improves dramatically with table count. Each VO retry reads a manifest list for each snapshot committed to that table since the read snapshot; with more tables, each table sees fewer commits, reducing the per-table retry cost. At 10ms CAS with 50 tables, VO approaches FA success rates above 100ms inter-arrival; at 50ms CAS, VO and FA success rates converge above 10 tables for moderate loads. At 120ms CAS, the catalog limits both FA and VO equally.

Exp 4b: Heatmap of FA success rate (90/10 FA/VO mix) by table count and inter-arrival scale, 10ms CAS. Similar to exp4a FA-only: 50 tables at 20ms is 69.4%, 1 table at 20ms is 13.8%. VO presence barely affects FA success. Exp 4b: Heatmap of VO success rate (90/10 FA/VO mix) by table count and inter-arrival scale, 10ms CAS. VO benefits from table partitioning: 1 table at 20ms is near-zero, 50 tables at 20ms tracks FA closely. 10 tables at 100ms reaches 99%+; more tables nearly eliminate the VO disadvantage. VO converges to FA success rates with enough tables.
Exp 4b: FA and VO success rates and VO latency (90/10 mix, 10ms and 120ms CAS). FA success is nearly identical to 4a; VO success improves dramatically with table count.

Under a Zipfian (α = 1.5) distribution, the probability of selecting the kth-ranked table is proportional to 1/k1.5. The rank-1 table absorbs ~50% of writes regardless of how many tables exist; rank-2 gets ~18%, rank-3 ~10%, and the distribution falls off steeply. The effective table count tops out at ~4.5 even with 50 physical tables.

Exp 4a: Zipfian FA success rate by table count and inter-arrival, 50ms CAS. Much worse than uniform: 50 tables at 20ms is only 32.3% (vs 35.1% uniform). 10 tables at 100ms is 78.9%. Adding tables beyond 10 barely helps- Zipf 50 tables approximates uniform 5 tables. Exp 4a: Conflict type breakdown by table rank at 50 tables, ias=100ms, Zipf 50ms CAS. Rank-1 table dominates with ~44% of writes and mostly same-table (tblptn) conflicts. Cold tables (rank 10+) have more catalog conflicts than table conflicts.
Exp 4a: Zipfian table selection FA-only (50ms CAS). The rank-1 table dominates, collapsing the benefit of additional tables.

The rank-1 table behaves approximately like a single table at half the global arrival rate, with a small penalty from catalog conflicts. At low load, rank-1 success rates and latencies converge to the single-table baseline; at high load, catalog conflicts from other tables’ writes consume part of the retry budget, degrading success rates below the single-table equivalent.

Under Zipf, 70% of retries are same-table conflicts (requiring manifest I/O), compared to ~2% under uniform distribution with 50 tables. Adding physical tables beyond 10 barely helps- Zipf with 50 tables performs like uniform with ~5 tables.1 This is unsurprising, given that additional tables shift diminishing fractions of the workload.

Exp 4b: Zipfian VO success rate (90/10 mix) by table count and inter-arrival, 50ms CAS. VO benefits from table partitioning but less than uniform: 50 tables at 100ms reaches ~75%, 10 tables at 100ms ~68%. 1 table at 100ms is ~47%. The hot table concentrates per-table conflicts, limiting the benefit of additional tables. Exp 4b: Conflict type breakdown by table rank at 50 tables, ias=100ms, Zipf 50ms CAS, 90/10 mix. Rank-1 table dominates with ~4.7 FA table/partition conflicts per transaction and ~2.4 catalog conflicts uniformly across ranks. Cold tables (rank 10+) have mostly catalog conflicts.
Exp 4b: Zipfian table selection 90/10 FA/VO (50ms CAS). The rank-1 table dominates, collapsing the benefit of additional tables.

Adding back VO transactions to the zipfian distribution, we see a similar effect: the most popular table converges to single-table performance, VO transactions are more sensitive (particularly at high load) to catalog conflicts, and sustainable single-table throughput with VO transactions is much lower. Catalog conflicts are evenly distributed across tables, but the most popular table also accumulates per-table conflicts where VO transactions struggle to complete at high load.

Takeaway: Uniform distribution shows partitioning is effective, until the CAS latency becomes the bottleneck. When the distribution is skewed (zipfian), retries from popular tables have minimal impact on other tables. These results also suggest that catalog and table conflicts should be handled separately by the retry policy. While these simulations assume a steady arrival rate following a distribution, many real workloads burst in a particular table. Exponential backoff for table conflicts and immediate retry for catalog conflicts would be more effective for that workload.

4c. Multiple tables, varied workload ratio, measured CAS distributions

Real provider CAS latencies (22-170ms) fall well above the 1-10ms sweet spot from 4a/4b, so most workloads will operate in the regime where CAS latency limits throughput. Now we substitute the CAS latencies measured for each provider and published unconditional read/write latencies. The full results are in Appendix B.

We’re assigning labels to these distributions matching observations from each provider, but this is still a model of the commit protocol. We’re interested less in hitting the moving target of real provider performance and more in learning from the model: when does CAS latency become the bottleneck? (When) does storage variability (modeled as a lognormal distribution) impact commit success rates?

The synthetic parameter sweeps in 4a/4b varied CAS latency and workload to measure commit success rates/latency holding the provider (S3) constant. Now we want to see how different provider profiles interact with workload mixes and table counts.

Latency distributions for storage providers (click to expand)

Distribution Parameters

GET (unconditional read)

Modeled as Lognormal(mu=ln(median), sigma), floored at min_latency_ms.

Providermedian (ms)sigmamin_latency (ms)
S3 Express2.50.571
S3 Standard270.6210
Azure Premium350.0820
Azure Standard380.6620
GCS2000.3080

PUT (unconditional write)

Modeled as Lognormal(mu=ln(base + rate * size_MiB), sigma), floored at min_latency_ms.

Providerbase (ms)rate (ms/MiB)sigmamin_latency (ms)
S3 Express6.5100.241
S3 Standard60200.2910
Azure Premium41150.1020
Azure Standard45250.5020
GCS200170.3080

GET Percentiles

Providerp5p10p25p50p75p90p95p99
S3 Express11224569
S3 Standard10121827416075114
Azure Premium3132333537394042
Azure Standard202024385989113176
GCS122136163200245294328402

PUT Percentiles

Providerp5p10p25p50p75p90p95p99
S3 Express4567891012
S3 Standard37425060738797118
Azure Premium3536394144474952
Azure Standard202432456386103145
GCS122136164200245294328402

Provider summary

ProviderCAS median (ms)CAS σRead base (ms)Read σWrite base (ms)Write σMin latency (ms)
S3 Express220.222.50.576.50.241
S3610.14270.62600.2910
Azure Premium640.73350.08410.1020
Azure930.82380.66450.5020
GCP1700.912000.302000.3080


Single-table Provider Performance

Provider100/0 (c/s)100/0 (FA/VO lat)90/10 (c/s)90/10 (FA/VO lat)50/50 (c/s)50/50 (FA/VO lat)
s3x14.60.16s / —7.40.11s / 6.7s7.40.11s / 6.7s
s31.81.00s / —1.81.01s / 20.2s1.80.99s / 20.0s
azurex2.41.12s / —1.80.89s / 21.7s1.80.87s / 21.2s
azure1.81.50s / —1.51.26s / 22.6s1.51.25s / 23.8s
gcp0.44.56s / —0.44.60s / 32.9s0.44.62s / 30.1s
Provider throughput and mean FA/VO latencies for a single-table where over 95% of VO transactions succeed

In the commit path, I/O latency for metadata is the dominant factor across providers. S3 Express One Zone is in its own class on this workload, delivering 3-6x the throughput of the next tier (S3 Standard, Azure Premium, Azure Standard) and up to 20x the throughput of GCS. Its low latency and low variance for both reads and writes compress the entire commit pipeline.

To accommodate 10% VO transactions, even S3x requires a 2x reduction in throughput to keep success rates above 95%.

Exp 4c: Single-table 90/10 FA/VO provider metrics. Four panels show FA success rate, VO success rate, FA mean latency, and VO mean latency vs inter-arrival scale. S3x sustains FA at 100% from ~100ms; S3, Azure Premium, and Azure Standard reach ~99% only near 500ms; GCP does not reach 99% until ~2000ms. VO success drops much earlier than FA for all providers.
Single-table provider performance in 90/10 FA/VO workloads

Multiple tables

With uniform table selection, partitioned workloads fall into three performance tiers:

s3x (14.9 c/s at 10 tables, up to 36 c/s at 20+ tables) » s3 / azurex / azure (3.7 c/s) » gcp (0.7 c/s simulated, ~0.7 actual)

For S3, Azure Premium, and Azure Standard the usable rate plateaus at 3.7 c/s from 5 tables through 20 tables. The per-table ceiling at S3 median latencies is ~5.7 c/s (five serial S3 ops per attempt), so the aggregate knee lines up with the catalog-CAS rate taking over only when per-table offered load drops far enough. That happens at 50 tables for S3 Standard FA-only, where the knee jumps to 7.2 c/s — but adding even 10% VO pulls it back down to 3.7 c/s. Azure Premium and Azure Standard stay at 3.7 c/s all the way through 50 tables.

This tier is narrower than it looks. Adding tables beyond ~5 buys little for S3/Azure on mixed workloads: the per-table metadata pipeline is the binding constraint, not the catalog round-trip.

Outliers: S3 Express One Zone (S3x) and GCP

S3x benefits from partitioning immediately and keeps benefiting. Its single-table knee is 14.6 c/s FA-only (7.4 c/s mixed); with 2+ tables mixed workloads also sustain 14.9 c/s. Beyond 20 tables FA-only, cross-table CAS contention still dominates but each failure costs only a re-CAS (manifest I/O is skipped for disjoint-table conflicts), so the usable FA-only rate jumps to ~36 c/s at 20–50 tables.

The simulated GCP rate is 0.7 c/s at 50 tables, up from 0.4 c/s at 1 table — a much weaker curve than the other providers. GCS’s per-op latency is high enough (~200 ms GET/PUT, ~170 ms CAS median) that each attempt costs roughly 590 ms even with a fast catalog, so per-table throughput caps at ~1.7 c/s and the catalog CAS caps the aggregate at ~8.5 c/s. The real product is much lower: GCS measured CAS throughput saturates at 0.8–1.4 op/s before any commit-protocol overhead.

Main tier: S3 Standard, Azure Premium, Azure Standard

The other three stores are more interesting. S3 and Azure Premium have similar CAS medians (61 vs 64ms), but Azure Premium’s CAS sigma is 5x larger (0.73 vs 0.14). Azure is worse on both axes: higher median (93ms) and higher sigma (0.82).

ProviderCAS medianCAS σRead σWrite σ
S361ms0.140.620.29
Azure Premium64ms0.730.080.10
Azure93ms0.820.660.50
Provider lognormal distribution parameters

In this model, both Azure Premium and Standard have higher CAS variance, but Premium has very tight read/write variance; its I/O is predictable even if the CAS is noisy. At 10 tables all three stores land at the same 3.7 c/s knee — they are all bound by the per-table metadata pipeline, not the catalog CAS. Azure Standard’s wider I/O variance shows up in mean latency (1.33s vs S3/Azure Premium’s 0.82s) and VO mean latency (6.8s vs 4.7s).

Provider100/0 (c/s)100/0 (FA/VO lat)90/10 (c/s)90/10 (FA/VO lat)50/50 (c/s)50/50 (FA/VO lat)
s3x14.90.11s / —14.90.12s / 1.4s14.90.11s / 1.4s
s33.70.82s / —3.80.82s / 4.7s3.70.82s / 4.6s
azurex3.70.82s / —3.70.82s / 4.8s3.70.81s / 4.8s
azure3.71.33s / —3.71.33s / 6.8s3.71.32s / 6.8s
gcp0.73.99s / —0.74.01s / 8.6s0.73.99s / 8.8s
Provider throughput and mean FA/VO latencies for 10 tables where over 95% of VO transactions succeed
Exp 4c: Ten-table 90/10 FA/VO provider metrics. Four panels show FA success rate, VO success rate, FA mean latency, and VO mean latency vs inter-arrival scale. VO success rates improve dramatically vs single-table: S3x reaches 100% VO at 50ms, S3 and Azure Premium reach ~93-99% VO at 100ms. VO mean latency drops significantly with table partitioning. GCP remains worst but also benefits.
Ten-table provider performance in 90/10 FA/VO workloads

Put another way: when Azure Standard retries take 5-10x the median, they’re almost certainly waste. In settings with high variance, commit protocols need to minimize how often a retry attempt samples from a fat-tailed distribution. Azure Standard’s higher variance shows up in FA mean latency (~1.33s vs S3/Azure Premium ~0.82s) and in VO mean latency rather than in the sustainable rate, which lands on the per-table bound at 3.7 c/s just like S3 and Azure Premium.

Takeaway: S3x is in a different class for catalog-as-file workloads. Distributing load across tables is effective, but store variance can drive failure rates up even at low arrival rates.

Conclusion

We tested two levers for improving commit throughput under contention: adding tables and choosing a faster storage provider. Both help, but the gap between them widened with the corrected per-attempt I/O cost.

Provider choice matters more than table count. S3 Express sustains 14.6 c/s (FA-only) on a single table — more per-table throughput than spreading the same load across 20 tables on S3 Standard (3.7 c/s aggregate, ~0.18 c/s per table). Fast storage compresses the entire commit pipeline (CAS + manifest I/O), while adding tables only relieves catalog contention. For providers in the 1–2 c/s single-table tier (S3, Azure Premium, Azure Standard), 5+ tables yield ~2x aggregate scaling to 3.7 c/s and stall there; only at 50 tables does S3 Standard’s FA-only knee climb to 7.2 c/s.

Table partitioning reduces VO retry cost. Each VO retry reads manifest lists proportional to snapshots committed to that table since the read snapshot. With more tables, each table sees fewer commits, and VO tail latency drops accordingly. For S3 Standard 90/10 mix, VO mean latency falls from 20.2s at 1 table to 4.7s at 10 tables and 2.6s at 20 tables. S3 Express drops from 6.7s to 0.8s over the same range. Under Zipfian skew, the hot table still converges to single-table performance, but catalog contention does not impact the less popular tables.

The protocol still matters at single-table scale. On a single table, VO mean latency reaches tens of seconds at moderate throughput regardless of provider (S3: 20.2s, Azure Premium: 21.7s, Azure Standard: 22.6s, GCP: 32.9s). S3 Express’s 6.7s single-table VO mean is the lowest of any provider, though still impractical for most production workloads.

Take the simulated provider experiments with a grain of salt: the labels we’re putting on the storage distributions are drawn from measurements of real systems, but these parameters do not completely describe reality. These are optimistic models of provider performance. It is unlikely that real workloads could sustain these rates without external coordination.

One practical takeaway: catalog and table conflicts should be handled separately by the retry policy. Catalog conflicts are cheap to retry (re-read the catalog, re-apply the CAS) while table conflicts require re-reading manifest lists. Immediate retry for catalog conflicts and exponential backoff for table conflicts would better match the cost structure.

More broadly, these simulations may be sufficient to indict the commit protocol. Writing copy-on-write objects to storage on every commit attempt is a self-imposed obstacle to scaling commit throughput, more than writing to the same catalog object in every commit.

Correction 2026-06-09

The numbers in this post were regenerated on 2026-04-17 after finding some simulator bugs, listed below. The qualitative conclusions survive- the tier ordering and the “partitioning helps” story are intact- but the multi-table knees moved substantially, in both directions: the 5-20 table knee for S3/Azure drops by half, GCP scaling is much weaker than presented, and S3x at 20+ tables jumps to roughly 2.5x the published rate.

  • Per-attempt I/O cost corrected. Each commit attempt now issues five S3 round-trips (TM_read + ML_read + ML_write + TM_write + CAS) instead of three. The per-table ceiling at S3 medians tightens from ~7.7 c/s to 5.7 c/s. For S3/Azure at 5–20 tables the knee drops from ~7.4 to 3.7 c/s: the per-table bound bites before the catalog CAS does. Only at 50 tables does catalog CAS again become the bottleneck (for S3 FA-only).
  • Cross-table CAS free retry. Multi-table CAS failures caused by commits on different tables no longer charge manifest I/O, and only charge for the CAS conflict. This is why S3 Express at 20+ tables FA-only jumps from 14.9 to ~36 c/s: its CAS throughput was already the binding constraint, and removing the spurious manifest I/O lets more commits through.
  • Timing leaks removed. The CAS version check and catalog.read() both split-yield at half-RTT, eliminating a fast path where clients could commit against state propagating faster than the full CAS round trip.
  • VO convoy decomposed per-table. The IO convoy was using a global value and ignored the per-table configuration. Corrected, multi-table VO tails are lower.
  • GCP scaling is weaker than presented. With the tighter per-table bound, 50 tables only reach 0.7 c/s for GCP (was 3.6 c/s in the old model).
  • table_metadata_inlined config drift fixed. An orphaned template flag had silently switched intermediate re-runs to the 1/(3L) inlined bound; the post now uses non-inlined metadata, matching the original intent.

The tier ordering (S3x » S3/Azure Premium/Azure Standard » GCP) is unchanged, which makes sense given the measured CAS latencies. Adding tables improves scalability, but less than the original numbers implied, especially at 5–20 tables.

One presentational change: VO latency figures in this post are now reported as the mean, where the previous version cited VO P99. With the per-table convoy fix, the mean is a more stable summary of the distribution.

Full correction details (AI generated)

Single-table knees at >95% VO success:

ProviderPublished FA-onlyCorrected FA-onlyPublished 90/10Corrected 90/10
S3 Express14.6 c/s14.6 c/s7.5 c/s7.4 c/s
S3 Standard2.4 c/s1.8 c/s1.8 c/s1.8 c/s
Azure Premium2.5 c/s2.4 c/s1.9 c/s1.8 c/s
Azure Standard2.4 c/s1.8 c/s1.5 c/s1.5 c/s
GCP0.7 c/s0.4 c/s0.4 c/s0.4 c/s

The biggest structural change is the multi-table FA-only knee for S3 Standard and Azure Premium at 5–20 tables:

ProviderTablesPublishedCorrectedΔ
S3 Standard5–207.2–7.4 c/s3.7 c/s−50%
S3 Standard507.4 c/s7.2 c/s−3%
Azure Premium5–207.2–7.4 c/s3.7 c/s−50%
Azure Premium507.4 c/s3.7 c/s−50%
Azure Standard10–503.7 c/s3.7 c/s
GCP503.6 c/s0.7 c/s−81%
S3 Express5014.9 c/s36.0 c/s+142% (free-retry for cross-table CAS failures)

For S3/Azure at 5–20 tables the published numbers were catalog-CAS-bound at ~7.4 c/s; with the non-inlined per-table ceiling now at 5.7 c/s, the per-table bound binds first and the knee flattens to 3.7 c/s. Only at 50 tables does catalog CAS again become the bottleneck (for S3 Standard FA-only; Azure Premium sits at the per-table bound through 50 tables).

S3 Express at 50 tables jumps to 36 c/s because its ~10 ms per-op latency keeps the per-table bound very high; with the ec383ff free-retry fix, cross-table CAS failures no longer charge manifest I/O, so the catalog handles more commits.

GCP’s multi-table scaling is weaker than presented: 0.4 c/s (1 table) to 0.7 c/s (50 tables) FA-only, versus the published 0.7 c/s to 3.6 c/s. The per-op latency is high enough that the per-table bound binds through 50 tables.


Appendix A: Full 4a/4b results

Full heatmaps for experiments 4a (FA-only) and 4b (90/10 FA/VO mix) across table counts, inter-arrival times, and catalog CAS latencies. Both uniform and Zipfian table selection distributions are included.

4a: FA-only, uniform (10ms, 1ms CAS)
Exp 4a: Heatmap of FA success rate by number of tables (1-50) and inter-arrival scale, 10ms CAS. More tables dramatically improve success: 50 tables achieve 69.5% at 20ms inter-arrival vs 13.9% with 1 table. 10+ tables reach 100% at 100ms+ inter-arrival. Exp 4a: Heatmap of mean commit latency by table count and inter-arrival scale, 10ms CAS. Latency drops with more tables: 50 tables at 20ms is 701ms vs 1443ms for 1 table. Baseline converges to 338-455ms at 5000ms inter-arrival. Hatched cells indicate low success rates.
Exp 4a: FA success rate and latency by table count and inter-arrival time (10ms, 1ms CAS).
4a: FA-only, uniform (50ms, 120ms CAS)
Exp 4a: Heatmap of FA success rate by table count and inter-arrival scale, 50ms CAS. Worse than 10ms: 1 table at 20ms is 11.5%, 50 tables at 20ms is only 35.1%. Even 20-50 tables need 200ms+ inter-arrival for 99%+ success. Exp 4a: Heatmap of mean commit latency by table count and inter-arrival scale, 50ms CAS. Higher baseline than 10ms: 1 table at 5000ms is 578ms vs 455ms. At 20ms, 1 table reaches 1935ms and 50 tables 1218ms. Nearly all low-arrival cells are hatched.
Exp 4a: FA success rate and latency by table count (50ms, 120ms CAS).
4a: FA-only, Zipfian (50ms CAS)
Exp 4a: Zipfian FA success rate by table count and inter-arrival, 50ms CAS. Much worse than uniform: 50 tables at 20ms is only 32.3% (vs 35.1% uniform). 10 tables at 100ms is 78.9%. Adding tables beyond 10 barely helps- Zipf 50 tables approximates uniform 5 tables. Exp 4a: Conflict type breakdown by table rank at 50 tables, ias=100ms, Zipf 50ms CAS. Rank-1 table dominates with ~44% of writes and mostly same-table (tblptn) conflicts. Cold tables (rank 10+) have more catalog conflicts than table conflicts.
Exp 4a: FA success rate and latency with Zipfian table selection (50ms CAS).
4b: 90/10 FA/VO mix, uniform (10ms, 1ms CAS)
Exp 4b: Heatmap of FA success rate (90/10 FA/VO mix) by table count and inter-arrival scale, 10ms CAS. Similar to exp4a FA-only: 50 tables at 20ms is 69.4%, 1 table at 20ms is 13.8%. VO presence barely affects FA success. Exp 4b: Heatmap of VO success rate (90/10 FA/VO mix) by table count and inter-arrival scale, 10ms CAS. VO benefits from table partitioning: 1 table at 20ms is near-zero, 50 tables at 20ms tracks FA closely. 10 tables at 100ms reaches 99%+; more tables nearly eliminate the VO disadvantage. VO converges to FA success rates with enough tables.
Exp 4b: FA and VO success rates (90/10 mix, 10ms and 1ms CAS).
4b: 90/10 FA/VO mix, uniform (50ms, 120ms CAS)
Exp 4b: Heatmap of FA success rate (90/10 mix) by table count and inter-arrival scale, 50ms CAS. 1 table at 20ms is 11.4%, 50 tables at 20ms is 35.1%. At 100ms, 50 tables reach 95.5%, 1 table is 47.0%. Exp 4b: Heatmap of VO success rate (90/10 mix) by table count and inter-arrival, 50ms CAS. VO improves substantially with table count: 50 tables at 100ms approaches ~95%, 10 tables at 50ms ~69%. 1 table at 100ms is ~47%, 50 tables at 20ms ~35%. FA and VO converge at high table counts.
Exp 4b: FA and VO success rates (90/10 mix, 50ms and 120ms CAS).
4b: 90/10 FA/VO mix, Zipfian (50ms CAS)
Exp 4b: Zipfian VO success rate (90/10 mix) by table count and inter-arrival, 50ms CAS. VO benefits from table partitioning but less than uniform: 50 tables at 100ms reaches ~75%, 10 tables at 100ms ~68%. 1 table at 100ms is ~47%. The hot table concentrates per-table conflicts, limiting the benefit of additional tables. Exp 4b: Conflict type breakdown by table rank at 50 tables, ias=100ms, Zipf 50ms CAS, 90/10 mix. Rank-1 table dominates with ~4.7 FA table/partition conflicts per transaction and ~2.4 catalog conflicts uniformly across ranks. Cold tables (rank 10+) have mostly catalog conflicts.
Exp 4b: VO success rate and conflict type distribution with Zipfian table selection (50ms CAS).


Appendix B: Full 4c results

Galleries of success rate and latency heatmaps for all 5 storage providers, across all table counts, inter-arrival times, and workload mixes. Click on an image to view the gallery and flip through them.

S3
Exp 4c: S3 FA=100% success rate. 1 table at 20ms is 11.6%, 50 tables is 37.1%. Reaches 100% by 500ms for 2+ tables. Very similar profile to standard Azure. Exp 4c: S3 FA=90% FastAppend success rate. 1 table at 20ms is 12.6%, 50 tables is 37.2%. At 100ms, 50 tables reach 96.1%. Similar to Azure at high table counts.
Exp 4c: S3 Standard success rates. Heatmaps for FA=100%, FA=90% (FA and VO), and FA=50% (FA and VO) across table counts and inter-arrival times.
Exp 4c: S3 FA=90% FastAppend mean latency. 1 table at 5000ms is ~713ms, 50 tables at 5000ms is ~484ms. At 20ms inter-arrival, latencies range ~1482-1899ms for 20-50 tables. Hatched cells cover the left portion. Exp 4c: S3 FA=90% ValidatedOverwrite mean latency. VO latency higher than FA but benefits from table partitioning: at 5000ms, ranges from ~713ms (1 table) down to ~484ms (50 tables). Hatched cells in the left region indicate low success rates.
Exp 4c: S3 Standard commit latency. FA/VO mean latency heatmaps for FA=90% and FA=50% mixes. Hatched cells indicate low success rates.
S3 Express One Zone
Exp 4c: S3 Express FA=100% success rate. Dramatically better than all other providers. 1 table at 20ms is 66.1%, 50 tables is 96.9%. Only degradation is at 20ms; 50ms+ is 98%+ everywhere. Exp 4c: S3 Express FA=90% FastAppend success rate. Nearly perfect: 1 table at 20ms is 65.4%, 50 tables at 20ms is 96.9%. At 50ms+, all configurations reach 97.7%+. Only the 20ms column shows any degradation.
Exp 4c: S3 Express success rates. Heatmaps for FA=100%, FA=90% (FA and VO), and FA=50% (FA and VO) across table counts and inter-arrival times.
Exp 4c: S3 Express FA=90% FastAppend mean latency. Very low: 1 table at 20ms is ~1487ms, 50 tables is ~227ms. At 5000ms, baseline is 79-104ms. Only 1-2 tables at 20ms show hatching. Exp 4c: S3 Express FA=90% ValidatedOverwrite mean latency. VO latency benefits from table partitioning: 50 tables at 20ms is ~227ms, 1 table at 20ms is ~1487ms. At 5000ms, ranges 79-104ms. Low CAS latency helps VO when combined with multiple tables.
Exp 4c: S3 Express commit latency. FA/VO mean latency heatmaps for FA=90% and FA=50% mixes. Hatched cells indicate low success rates.
Azure Standard
Exp 4c: Azure FA=100% success rate by table count and inter-arrival. 1 table at 20ms is 11.1%, 50 tables at 20ms is 29.9%. Reaches 100% by 500ms for 5+ tables. Similar profile to S3 Standard. Exp 4c: Azure FA=90% FastAppend success rate. 1 table at 20ms is 12.0%, 50 tables is 30.0%. Very similar to FA=100%; FA success is insensitive to 10% VO in the mix.
Exp 4c: Azure Standard success rates. Heatmaps for FA=100%, FA=90% (FA and VO), and FA=50% (FA and VO) across table counts and inter-arrival times.
Exp 4c: Azure FA=90% FastAppend mean latency. 1 table at 5000ms is ~1037ms, 50 tables at 5000ms is ~670ms. At 20ms, FA latency for 1-50 tables ranges ~2106-2961ms. Hatched cells cover the left half. Exp 4c: Azure FA=90% ValidatedOverwrite mean latency. VO latency higher than FA: at 5000ms, ranges from ~1037ms (1 table) down with more tables. At high load, 1 table latency reaches tens of seconds. Hatched cells cover the left region.
Exp 4c: Azure Standard commit latency. FA/VO mean latency heatmaps for FA=90% and FA=50% mixes. Hatched cells indicate low success rates.
Azure Premium
Exp 4c: Azure Premium FA=100% success rate. Better than standard Azure: 1 table at 20ms is 11.2%, 50 tables is 31.6%. Reaches 100% by 500ms for 2+ tables. 1 table at 100ms is 49.0% vs Azure's 44.1%. Exp 4c: Azure Premium FA=90% FastAppend success rate. 1 table at 20ms is 12.4%, 50 tables is 31.7%. At 100ms, 50 tables is 93.7%. Noticeably better than standard Azure FA=90% at high table counts.
Exp 4c: Azure Premium success rates. Heatmaps for FA=100%, FA=90% (FA and VO), and FA=50% (FA and VO) across table counts and inter-arrival times.
Exp 4c: Azure Premium FA=90% FastAppend mean latency. 1 table at 5000ms is ~728ms, 50 tables at 5000ms is ~471ms. At 20ms, 50 tables is ~1515ms. Hatched cells cover the left portion. Exp 4c: Azure Premium FA=90% ValidatedOverwrite mean latency. VO latency higher than FA but benefits from table partitioning: at 5000ms, ranges from ~728ms (1 table) down to ~471ms (50 tables). Hatched cells cover the left region.
Exp 4c: Azure Premium commit latency. FA/VO mean latency heatmaps for FA=90% and FA=50% mixes. Hatched cells indicate low success rates.
Google Cloud Storage (GCS)
Exp 4c: GCS FA=100% success rate. Worst-performing provider. 1 table at 20ms is 2.5%, 50 tables is 7.7%. Does not reach 100% until inter-arrival 2000 for 1-2 tables. Degradation extends much further right than other providers. Exp 4c: GCS FA=90% FastAppend success rate. Much worse than all other providers. 1 table at 20ms is 2.8%, 200ms is 23.6%, 500ms is 50.7%. 50 tables at 20ms is 7.7%.
Exp 4c: GCS success rates. Heatmaps for FA=100%, FA=90% (FA and VO), and FA=50% (FA and VO) across table counts and inter-arrival times.
Exp 4c: GCS FA=90% FastAppend mean latency. Very high: 1 table at 5000ms is ~4141ms, 50 tables at 5000ms is ~2352ms. GCS's high base CAS latency inflates all commit latencies. Hatched cells cover most of the left region. Exp 4c: GCS FA=90% ValidatedOverwrite mean latency. Very high due to GCS's high base CAS latency. At 5000ms, ranges ~2352-4141ms across table counts. At high load, 50 tables at 20ms reaches ~6632ms. Most cells are hatched.
Exp 4c: GCS commit latency. FA/VO mean latency heatmaps for FA=90% and FA=50% mixes. Hatched cells indicate low success rates.
  1. The full set of plots for these simulations are here