The CDC MERGE Pattern

cat pawing at balloon

The MERGE pattern is the first design I discuss in this series that maintains data directly in a mirror table rather than applying the latest set of changes at read time using a view. This is the go-to solution for most people, and it generally works well enough. But it’s not without a few gotchas. This post covers the pattern itself and the trade-offs introduced by batch updates.

If you haven’t already, you may want to read the previous posts in this series:

CDC MERGE pattern basics

Let’s start with a look at the most direct implementation. The data is maintained by updating a mirror table structured for the bank account example, like this:

CREATE TABLE accounts (
    account_id bigint,
    balance decimal(16,2),
    tid bigint,
    last_updated timestamptz)
PARTITIONED BY (bucket(128, account_id))

This DDL sets the account_id as the table’s primary key and configures the layout to help reduce work by partitioning the key space into buckets.

Just as it sounds, the MERGE pattern uses a MERGE command to update the table using incoming changes. Existing rows and updates are matched on account_id. The CDC updates come from a changes source — more detail on that later. Each change is handled using the first WHEN clause that applies:

MERGE INTO accounts a USING changes c
ON a.account_id = c.account_id
    SET a.tid = c.tid,
        a.balance = c.balance,
        a.last_updated = c.last_updated
    INSERT (account_id,   balance,   tid,   last_updated)
    VALUES (c.account_id, c.balance, c.tid, c.last_updated)

The command is boringly straightforward: update or insert depending on whether the account is already present in the mirror table, and drop the row if the change was a delete. Unfortunately, this is too simple and quickly hits errors in actual use:

TrinoException: One MERGE target table row matched more than one source row

That’s basically saying there was more than one change for an account, and you aren’t allowed to update a row twice. That makes sense: you can’t update an account balance if there is more than one balance to choose from. But why are there multiple updates?

It’s batches all the way down

Up until now, we’ve been able to ignore an important detail of analytic tables: writes are inherently batch operations. This is true of all data lake tables, regardless of format (Iceberg or Hive-like formats). Cloud storage doesn’t support modifying files, and file formats such as Parquet are designed to be immutable. So all changes either add new files or completely replace existing files.

Even when a table is used as a stream, the commits are batch operations that produce discrete versions (snapshots, in Iceberg terminology). This is a fundamental trade-off of using analytic tables. Batching work into larger operations raises efficiency but comes with consequences. It is also critical for achieving exactly-once writes.

Consequences of batch processing in CDC MERGE operations

There are 3 significant ramifications of batch processing in this context:

  1. The double update problem that the direct MERGE ran into above. Multiple updates in a batch must be squashed into a single change per row because the query engine doesn’t know how to pick the “right” one. This complicates the logic of a job. But — more importantly — it also loses change information. The MERGE pattern and other batch operations don’t preserve the actual history of a table for downstream consumption.
  2. The lossy nature of batch updates breaks fine-grained time travel. If the merged mirror table guarantees transactional consistency, then at least time travel to a prior snapshot is safe. But the double update problem suppresses the changes needed to reconstruct states between snapshots. And if the mirror table is eventually consistent, then time traveling will bring back past inconsistencies.
  3. The hard trade-off between efficiency and latency. Latency is how far behind the mirror table is from the source table in time. The trade-off is that as the frequency of commits increases, the total amount of work needed to update the table also increases. Larger batches increase efficiency; smaller batches decrease it. This is one reason why it’s impractical to solve the double update problem by making commits so small they have just one transaction. Also, note that this problem is distinct from write amplification — the subject of the next post — because that concerns tricks for deferring work until later.

Making the CDC MERGE pattern work

Avoiding the double update problem has a familiar solution: the window CTE from the earlier posts that powered the mirror view. In fact, you can use a variation of mirror view that keeps delete records — let’s call this accounts_changes. I’ll use that view in place of the changes source that I skipped earlier.

To achieve transactional consistency, the MERGE uses the changes between the last run’s “valid through” tid and the current one.

MERGE INTO accounts a USING accounts_changes c
ON a.account_id = c.account_id AND
    tid >= ${last_run_tid} AND
    tid < ${valid_tid}
    SET a.tid = c.tid, a.balance = c.balance,
        a.last_updated = c.last_updated
    INSERT (account_id,   balance,   tid,   last_updated)
    VALUES (c.account_id, c.balance, c.tid, c.last_updated)

Reading the view is efficient because the lower bound on tid avoids reading the full history.

All that’s left is to orchestrate the job to run on a regular basis.

CDC MERGE pattern’s strengths and weaknesses

The MERGE pattern is often the reflexive choice for CDC mirroring because it can be implemented with just one table. Consumers can easily query the mirror table just as they would query the source. Meanwhile, the data producer has just one job to maintain. But using the MERGE pattern with a single table inherently trades features that are important — the double update problem means the table is not a reliable record of changes, and adding transactional consistency is much harder.

The biggest problem with using MERGE is that it’s not obvious that change history is lost. It’s easy to think that a single table can be both an accurate change log and the materialized mirror, but that opens the door to data gremlins. It isn’t safe to use a MERGE mirror table as a streaming change source.

On the other hand, parts of the MERGE pattern are really useful. I’ve long advocated that people shouldn’t need special training to query a table — like needing to understand partitioning and layout or getting tripped up by eventual consistency. MERGE mirror tables are simple to query and are read-optimized (in theory) because the work to determine the latest state of each row is done at write time. And the SQL engine running the MERGE can optimize the write plan and cluster data for faster scans or smaller storage costs.

To get the best balance of features, combine the MERGE pattern with a change log table, as shown in the SQL example above. Augmenting a MERGE mirror table with a change log table addresses the pattern’s shortcomings. You can consume the change log as an accurate streaming source that has much longer retention than either Apache Kafka or the source CDC log. And having a change log makes building transactional consistency guarantees much, much easier.

One last warning: latency is the Achilles heel of the CDC MERGE pattern. It’s tempting to run MERGE frequently to reduce the replication latency, but that’s a brute-force approach. Updating the mirror table twice as often roughly doubles the amount of work needed to keep it up-to-date. For a slow-changing table, this may not matter enough for you to care. But for a table with high-volume updates that require low latency, the MERGE pattern is often too inefficient to be a good strategy.

The next post in this series covers write amplification, and I’ll dive deeper into both write efficiency and the trade-offs of deferring work until read time using either a lazy or an eager strategy (that is, either merge-on-read or copy-on-write) to write MERGE changes.

Read Zen and the art of CDC performance