Conditional Operations in Object Stores

21 minute read

Published:

tl;dr

Since S3 introduced strong(er) consistency in 2020 and conditional writes in late 2024, table formats like Apache Iceberg can assume stronger guarantees from object storage. Stricter consistency guarantees were already supported by Azure and GCS1. How do conditional operations perform under load?

We benchmark the atomic/conditional primitives that object stores provide across AWS S3 and S3 Express One Zone, Azure Blob (Standard and Premium), and Google Cloud Storage.

Object StoreCompare-and-Set (op/s)CAS Latency (ms)Append (op/s)Append Latency (ms)
GCS0.8-1.4500-700
Azure (Standard)9.2-10.2100-6009.2-10.290-100
Azure (Premium)15.3-16.660-35011.1-11.470-74
S314.7-15.358-68
S3X68.6-75.314-2671.0-94.114-24 (9800-10400)*

*S3 Express One Zone (S3X) lazily composes appends on read; see details below.

See the graphs for detailed throughput and latency analysis.

Conditional operations are useful building blocks for coordination through storage, but object stores are not designed to be consensus systems2. Storage-only coordination is limited to low throughput workloads without an intermediary service to batch and order operations.


Conditions for Coordination in Storage

I just dropped in to see what condition my condition was in

The ostensible Future of Data Systems is a converged data lake. Vendors build toward this vision in path-dependent ways, but table formats like Apache Iceberg are emerging as a common substrate for interoperability with limited (or no) coordination between systems outside of storage.

In the 2010s, we pretended that eventual consistency was usable because it was cheap. Iceberg made the elegant partition of its strongly-consistent Catalog state and the eventually-consistent object store where data and metadata files are stored. Transactions following a pointer from the catalog accessed a fixed, immutable set of objects; the only eventual state to reconcile was an object’s transition from absent to present. From an application perspective, this is like transforming a consistency problem- reasoning about which updates have been applied- to an availability problem- parts of a table snapshot might be temporarily missing.

Modern object stores provide stronger consistency guarantees and atomic operations over objects. So… is any of this useful to the design of table formats? Let’s measure the practical limits of conditional/atomic operations in object stores. They are the building block for linearizable swaps and commit protocols in an open, storage‑only catalog3.

Conditional Operations

Skip to measured results if you don’t want to read a primer on conditional operations.

In a nutshell, clients updating an object provide a token from data they read earlier to the object store, which rejects that write if the token doesn’t match the current state. In a table format, the application coordinating the transaction is saying, “I checked the table in state x. Only apply my update if the table is still in state x; otherwise leave the state of the table unchanged so I can retry, fail, or wander off forever.”

“Wandering off forever” is particularly important for table formats, since concurrent transaction processors could have wildly different resource constraints. The bloated corpse of a large, failed transaction could prevent writers with fewer resources from making progress if they first need to clean up the failed transaction.

Consider two typical Iceberg clients: Florence: a Spark job enriching tens of terabytes of table data using hundreds of nodes and Bob: inserting kilobytes of data from a streaming source. If Florence wanders off, nobody thought to provision Bob with the resources to undo/redo even a fraction of Florence’s work. Installing updates with a single, conditional operation avoids any dependencies between uncommitted transactions that might entangle their resources. It’s a key idea in Iceberg’s design4: all work preparing a transaction is completed outside the table before installing it at commit.

Three widely supported conditional operations could enable linearizable updates from uncoordinated writers to a single object.

If-absent

This is probably the most widely available conditional operation. Basically, “write this object only if it does not already exist”. With a naming convention for successors, one can order updates by probing for the latest version and writing the next version in the sequence. Obvious example: if the transaction depends on objname.x then attempt to write objname.(x+1). Some implementations store a recent-ish version that clients blindly overwrite in race after a successful commit, to provide a hint for the “tail” of the log.5

I didn’t evaluate this operation as its performance under the synthetic benchmark would be unrealistically pessimistic. Without retries or backoff, each writer would require multiple round trips to discover the current version before attempting its write. The benchmark would tell us less about the capabilities of the object store than the efficiency of the tail-locating protocol.

If-match

Every object is associated with a nonce that uniquely identifies the object at that location. The implementation of a compare-and-set (CAS) is straightforward: keep the nonce- an entity tag (etag; S3 or Azure) or generation (GCS)- when the object is read and provide it to ensure the correct version of that object is overwritten6. The write checksum must also protect against partial writes (e.g., closing the stream while unwinding the stack could replace the old version with a partially-written object).

The read-modify-overwrite loop makes this approach impractical as the object gets larger; the read costs get too high. Moreover, not all stores guarantee that concurrent readers can even complete the object read if it is interrupted by an overwrite. Readers should be protected from torn reads (i.e., reading a prefix of one version of the object and the suffix of its successor), but I’ve heard of widely used storage systems that failed to provide this basic protection.

Atomic Append

Object stores usually warehouse immutable objects, but Azure Blob Storage and S3 Express One Zone (S3X) support appending data to an existing object7. Both implement a conditional, position-based append: the client reads the current length of the object and attempts to write its data to that offset. As long as it’s the same base object, an append that fails the position check can be retried with the new length, avoiding the costs of reading and merging required by a compare-and-set.

Contrast this with POSIX O_APPEND or GFS/KFS/QFS record append. When the storage service owns the ordering of atomic appends, clients don’t need to know where the tail is when they append to an object. The service will fill holes if clients are slow or fail (avoiding partial writes) and assemble a contiguous object from the ragged tail of appends. Neither Azure nor S3X implement record append.

Instead, both Azure and S3X supply metadata for the clients to order their own operations. Conditional appends must match both the position and etag for the base object and include a checksum to ensure the full write succeeds. Limits on the size and number of appends per object apply: S3X allows up to 10,000 appends up to 5GiB each; Azure allows 50,000 appends up to 4MiB each.

Delegating append ordering to the client simplifies everything except the writer. Server side: replication and concurrent reads at the tail are much simpler if writes that are contiguous in the object arrive ordered. Slow or failed appends can’t be visible to external clients, so in an implementation of record-append: replicas need to handle cases where they disagree on the presence or content of a particular record. Not only in recovery, but in forward operation. Client side: readers also need to handle duplicate records. If a write times out, the writer can’t be certain if its append was applied or not. If the client retries, two duplicate records could be applied at distant offsets.

Acknowledging the implementation complexity, the storage service can reorder writes more efficiently than independent clients. As we’ll see in benchmarks, position append avoids the read and merge costs of CAS, but oblivious coordination across writers will limit its throughput.

Hereafter, I’ll refer to S3 Express One Zone as S3X and Azure with Premium storage as AzureX for brevity.

Performance

Benchmark setup (click to expand)

Throughput is measured using a custom YCSB client wrapping each cloud's Java SDK. The synthetic workload issues conditional updates against a single, 2KiB object _without retries or backoff_ to determine upper-bound throughput.

CAS operations read the full object before conditionally overwriting (reflecting real applications that merge state). Append workloads write 256-byte chunks until reaching 9900 appends (~2.4MiB), then reset via CAS. Measurements were taken in mid-June 2025.

To avoid intra-SDK side effects from caching, batching, and throttling, each client runs in its own JVM/process. Throughput and latency are measured over the time window when all JVMs ran concurrently: from the time the last client started to the time the first client finished. All runs used Ubuntu 20.04 images from the cloud vendor. Throughput and latency are averaged over five runs, each running for five minutes. I did not attempt to control for diurnal patterns or other known sources of variance, as we are concerned only with the broad differences between clouds and the viability of a shared storage architecture.

The benchmark models a roughly catalog-sized object in storage. This could be literally the Iceberg catalog, or really _any_ resource in object storage that concurrent, conflicting transactions need to update. If we allow conditional writes on the commit path, we want to know when they could become a bottleneck.

We used the following configuration (terraform scripts):

CloudVMRegion
AWSm5-4xlargeus-west-2
AzureStandard-D16s-v3West US
GCPn2-standard-16us-west1

CAS (Compare-and-Set)

Google Cloud Storage

GCS CAS throughput: 1.4 op/s declining to 0.8 op/s; failed ops increasing from 0 to 1.6 unsuccessful op/s at 16 clients GCS CAS latency: starts at 700ms, drops to ~500ms with large (100ms) error bars
GCS CAS performance: throughput (left) shows ~1 write/s throttling; latency (right) averages 500-700ms

Kudos to the team responsible for throttling single-object requests. The single write per second limit is documented. Even the single-client CAS latency is over 700ms, which is quite high. I wrote the GCS team to verify that I was using the SDK correctly and had not overlooked a storage configuration suited to this workload, but did not receive a response.

Given the documentation, it seems likely that this is performing as designed. I did not modify the benchmark to use composite object support to emulate append, as I assume the same limit applies.

A guess: versioning support probably retains multiple versions of the data, and without throttling the system could be overwhelmed. It would be worth testing if larger objects can be read without interruption from concurrent writes e.g., a workload of readers starting 100ms apart, interrupted by a write every 2s. CAS operations in Azure sometimes interrupted readers.

Azure Blob Storage

Azure CAS throughput: Steady goodput 9.2-10.7 op/s with increasing failed operations up to 36 op/s Azure (Premium) CAS throughput: Steady goodput 14.8-16.6 op/s with increasing failed operations up to 39.1 op/s
Azure CAS throughput: Standard (left) sustains ~10 op/s goodput; Premium (right) sustains ~15 op/s goodput

Goodput remains steady as the rate of failed operations increases with parallelism. Premium supports ~50% more goodput than standard storage.

A pipeline of conditional requests is mostly waste: a successful write invalidates all outstanding requests, so only one can succeed per round-trip. This fundamentally restricts conditional write throughput.

Azure CAS latency increasing linearly from ~100ms to ~550ms, 1-16 clients Azure (Premium) CAS latency increasing linearly from ~50ms to ~350ms, 1-16 clients
Azure CAS latency: Standard (left) and Premium (right) both show increasing latency under contention

The increasing latency of successful operations in Azure is not innate to conditional writes. The benchmark does not retry failed requests; the increase in latency must be attributed to the store, as every client runs in its own process.

Azure Blob Storage had some odd behaviors worth noting, against the advice of my generous reviewers whose taste is otherwise impeccable.

Azure CAS anomalies (click to expand)

Anomalies

First, the length reported after a CAS is sometimes _zero_. This happens intermittently; subsequent reads return the correct length. It's unclear if this is a consistency issue, client bug, or a bug in my implementation, but this (invalid) state should not be visible to clients between CAS operations.

Second, CAS operations sometimes caused outstanding reads to fail, particularly in the append workload. These errors were infrequent, but this microbenchmark is on small data and reads of larger objects might be more susceptible to interruption by concurrent writes.

Third, throughput using SAS tokens (Shared Access Signatures that delegate access without exposing the storage account key) is significantly higher than the default authentication mechanism, as shown below. At high client counts, errors include timeouts from whatever ancillary service is authenticating requests. Creating different SAS tokens per client has no effect on throughput, so the throttling appears to be from the default auth service.

Azure with default auth throttling goodput 8.5-3.1 op/s as clients increase Azure (Premium) with default auth throttling goodput 6.0-3.2 op/s as clients increase
Azure CAS throughput with default authentication (vs. SAS tokens): Standard (left) and Premium (right) show significantly lower throughput due to auth service throttling

AWS: S3 and S3X

S3 CAS throughput steady around 15 op/s as errors increase linearly with clients S3X CAS throughput steady around 75 op/s as errors increase linearly with clients
AWS CAS throughput: S3 (left) sustains ~15 op/s similar to Azure Premium; S3X (right) achieves ~80 op/s goodput, roughly 5x higher

S3 CAS throughput is similar to Premium Azure Blob Storage. S3X CAS throughput is roughly 5x what we see in any other store. Again, most of the pipeline is wasted work (nearly 600 failed op/s!), but few Iceberg warehouses require this kind of throughput, even across tables.

S3 CAS latency within 58-68ms across client counts, no clear pattern S3X CAS latency within 14-26ms across client counts, increasing with clients
AWS CAS latency: S3 (left) within 10ms variance; S3X (right) within 12ms variance-both much tighter than GCS and Azure

The S3 and S3X latencies have low variance compared to GCS and Azure. Note that the CAS latency in S3 is within 10ms and in S3X the increase in latency is within 12ms, compared to hundreds of milliseconds separating min/max latencies in GCS and Azure. S3X is neither general-purpose storage nor magic- it restricts several S3 features to deliver this performance- but it delivers what it claims on its label.

Append

Next, let’s measure position-based append performance. There’s a long history of log-structured systems delivering high-throughput over append-only logs. Unfortunately, the prenominate constraints on position append make high throughput unrealizable in practice.

Azure

Azure append throughput: steady goodput 11 op/s, errors increasing 0-4.9 op/s Azure (Premium) append throughput: steady goodput 14 op/s, errors increasing 0-5.7 op/s
Azure append throughput: Standard (left) and Premium (right) show similar patterns to CAS throughput

I’ll indulge in some eyeball statistics and call these pretty similar to the CAS throughput graphs above. The same read‑modify‑write cycle applies here. Because we do not retry operations in this benchmark, we’re not exercising the path where failed appends are retried at the new position; including retries would make latency a noisy proxy for retry rates.

For small objects, the virtues of append are not salient over CAS.

Azure append latency fairly steady 90-100ms Azure (Premium) append latency fairly steady 71-74ms
Azure append latency: Standard (left) and Premium (right) show lower latency than CAS, especially under contention

Latency of successful append operations is lower than CAS operations, particularly under contention. Only one or two runs on Azure appended enough data to reach the append limit, so we omit the interleaved CAS results.

S3 Express One Zone (S3X)

S3X append throughput: over 90 op/s goodput, errors increasing over 531 op/s at 16 clients S3X append latency showing increasing update latency 12-24ms, CAS operations over 10 seconds
S3X append: throughput (left) is extremely high; latency (right) shows CAS operations exceeding 10s

Append throughput is a striking ~90 ops/s, but CAS latency over 10 seconds? What is going on? Breaking down the latency of the CAS operation reveals that almost all the time is spent on the initial read of the object before the conditional write.

As we gradually add reads into the mix (0.1%, 1%, 10%, 30%), the mean latency of the CAS operation gradually increases as contention increases:

S3X append latency breakdown with varying read percentages
S3X append latency breakdown: CAS latency increases as read percentage grows, revealing lazy composition behavior

It’s not a coincidence that the append limits in S3X match the multipart upload limits (10,000 parts of at most 5GiB each)8. It appears that S3X accumulates parts (not enforcing the 5MiB minimum part size) until the object is read, and then lazily composes them into a single object. This is quite different from Azure’s implementation, which pays this cost eagerly for each append.

Discussion

First, some basic takeaways from the benchmarks. GCS was not designed to support this workload; the use case we’re interested in- linearizable writes on a single, small object- is explicitly throttled below a rate that could support even modest transaction rates. Azure Blob Storage and standard S3 sustain 10-15 conditional writes per second. The append implementations in both Azure and S3X exhibited anomalies in correctness and performance respectively, which complicates the implementation of optimistic protocols directly in these object stores.

Immutable data objects remain the core use case for object stores, even at premium tiers; stronger consistency and conditional operations are exciting, but insufficient to transform object stores into consensus systems.

Position Append is not (really) Multi-writer

Adding a “record append” API to object stores would allow the client to conditionally append not to a particular offset of a particular object, but to the end of a particular object… wherever it is.

Record append is not an easy API to use well, but it can improve throughput substantially9 and allow readers to tail the object coherently. Readers need to tolerate duplicates and implement framing for objects larger than the maximum append size (e.g., Azure Blob Storage’s 4MiB limit). To be fair: evidently the state of the art in position append isn’t that easy to use well either, at least for optimistic concurrency control.

To belabor a point: when a successful action invalidates all outstanding work, conditional operations create a lot of waste under load. Azure and S3 sustain steady throughput under contention, but as we see in these microbenchmarks, another service needs to be in front of the object store to batch operations and reduce conflicts to achieve higher throughput.

On the bright side, we can roll our own multi-writer append on top of conditional writes.

Emulating Record Append

Any service that guarantees the durability and ordering of writes can emulate record append on top of an object store supporting conditional writes. As in the (inferred) implementation of S3 Express One Zone append, primitives like multi-part uploads or composite objects in GCS can stitch together multiple writes into a single object. The trick is to get a consistent view for readers and lazily compact to the object store, coordinated using conditional writes.

AWS Object Lambdas can intercept GET, PUT, and LIST requests to S3 objects. If writes are funneled through a FIFO queue service like SQS, an object lambda can read a checkpoint in S3 and apply the queued appends to a particular object. On GET requests, the lambda replays queued writes over the base object from S3.

Asynchronously, the lambda function can conditionally write the compacted object (or append a prefix of the queue) to S3 and remove the corresponding entries from the queue. Other instances running concurrently may return a stale view, but never an inconsistent one. Conditional writes prevent compactions from losing state.

The object lambda adds cool points for being serverless and transparent to read-only transactions, but there’s no magic here. Clients could access the queue directly or any service could stand in front of S3 to batch and order writes.

Implications for Table Formats

So what does this mean for table formats? There are still opportunities to reduce write amplification, particularly using append. But for coordination through storage: any format needs to ensure real conflicts are rejected, making the rates measured here an upper bound on throughput under contention without an intermediary service.

Object stores are not (yet!) consensus systems. Data systems that use conditional writes should relegate them to a coarse, batch primitive for coordination, unless throughput demands are very low.

Thanks and Blame

Thank you Joe Hellerstein, Owen O’Malley, Tiemo Bang, Natacha Crooks, and Micah Murray for feedback on drafts. Ashvin Agrawal, Carlo Curino, Jesus Camacho Rodriguez, and Raghu Ramakrishnan at Microsoft (Gray Systems Lab) introduced me to table formats and helped shape many stages of a (now abandoned) research project on Iceberg. Mehul Shah’s prophecy steered me away from several pitfalls, except when I ignored his advice and fell into them anyway. Pavan Lanka and Sumedh Sakdeo offered useful insights into Iceberg workloads and ORC and Parquet internals, informing much of the work I’ll probably write about later. The misunderstandings and errors I managed on my own.

  1. Also IBM, Oracle, Alibaba 

  2. Cool storage disaggregation research excepted. 

  3. The Iceberg community deprecated its shared storage catalog (HadoopCatalog) last year, citing issues with atomicity in most storage systems. Clients now contact a catalog service to find the root of a table. Conditional writes could make a storage-only catalog possible, but we want to measure performance first to see if they are viable. 

  4. Tachyon includes a wonderfully creative mechanism for tying recovery and priority to resource allocation. 

  5. The Olympia catalog format also experiments with this approach in S3, using hint objects. The HadoopCatalog used a similar approach, but relied on rename as its atomic, if-absent operation and consistent listings to find the tail. 

  6. As a side-effect, including this data in Iceberg InputFile enforces its implicit immutability contract: if the object changes after being read, subsequent reads will fail the etag/generation check. 

  7. Alibaba and Huawei also support append. 

  8. Multi-part uploads allow clients to upload a file in chunks that remain invisible until a (destructive) complete operation combines all the parts into a single object. Parts can be reordered after upload, but once the complete operation is issued the object is immutable. 

  9. See CORFU for treatment of distributed log-based storage with record append semantics.