Hello, World of CDC!

batman and robin

This is the first in a series about mirroring transactional database tables into a data lake. Mirroring is an important tool. It keeps transactional databases isolated while still making the data available to the analytic world — otherwise, it’s far too easy for a rogue workflow to take down customer-facing infrastructure and everyone has a very bad day.

Change data capture and table mirroring

Table mirroring is part of the broader topic of Change Data Capture (more commonly, CDC). The idea behind CDC is to “capture” changes to a table as they happen for a variety of downstream uses – for example, to quickly update a search index. I’ll cover other CDC patterns in data lakes later, but table mirroring is the most common.

Using a CDC change stream for mirroring is a handy technique. It’s low-effort for the source database and produces updates quickly. But, as usual, there are trade-offs to consider.

Let’s put those trade-offs into context and start the series with the “Hello, World!” of mirroring patterns: the change log table. A change log table is just what it sounds like: a table of every row-level change to the source table over time.

Hello, Example Data!

An example using a change log for table mirroring

Let’s take a closer look with an example that helps illustrate patterns throughout this series: a table of bank accounts. The source table is called accounts and has columns for account ID and balance.

accounts
account_id: bigintbalance: decimal(16,2)
12404.94
28084.50
31940.31
......

This table is updated either when an amount is transferred from one account to another or when a new account is created. AWS DMS emits these changes as insert (I) or update (U) rows that have a timestamp and that contain the updated values of the source columns. To create the change log table, you simply store the emitted changes as is. That is, instead of storing the state of the source database at a given point in time, you record the full history by storing every change.

For example, here are the changes that show a new account creation with a $250 deposit and a $40 transfer from account 3 to account 2:

accounts_changelog
opchanged_ataccount_idbalance
I2023-06-01T14:12:019250.00
U2023-06-01T14:13:4431900.31
U2023-06-01T14:13:4428124.50
............

It may look odd that none of those rows appear to have a $40 transfer. These rows show the final balances after the transfer, not the transfer itself — account 3 had $1,940.31 and was updated to $1,900.31. This representation is called an UPSERT after the operation used to overwrite a whole row.

With the changes recorded, the remaining piece is to create the mirror. The change log has a row for every change to the source table. But the mirror should have one up-to-date row for each account. A view easily bridges this gap by using a window to select the latest version:

WITH windowed_changes AS (
    SELECT
        op,
        account_id,
        balance,
        changed_at,
        row_number() OVER (
            PARTITION BY account_id
            ORDER BY changed_at DESC) AS row_num
    FROM accounts_changelog
)
CREATE VIEW accounts AS
SELECT account_id, balance, changed_at
FROM windowed_changes WHERE row_num = 1 AND op != 'D'

I can already hear the objections — “that won’t scale!” — but the purpose here is to illustrate trade-offs. So let’s withhold judgment for just a moment and consider the unexpected benefits of this pattern. We’ll get to the objections shortly.

Hello, Trade-offs!

Unexpected benefits of using a change log for CDC

There are a surprising number of good things about this change log and view approach:

  • Simplicity – updates only require appending new changes. Appends are fast and never conflict with other operations such as compaction.
  • Accurate history – the change log table contains a reliable history of the source database table that’s immune to the lossiness caused by the “double update problem” (more on that later).
  • Change stream – accurate history and simple append-only writes create an ideal source for streaming changes.
  • Time travel – it’s possible to time travel to any point in time in the source table’s history, simply by filtering the view on changed_at to ignore newer changes.

Now, back to the “it won’t scale!” objections. Digging a bit deeper into scaling reveals not a problem but a trade-off: finding the latest version of a record is deferred until read time. Deferring work isn’t intrinsically good or bad. If the mirror is hardly ever read then it’s a good thing; when read frequently it’s a bad thing. Plus, not materializing the mirror table is what makes time traveling to any table state possible!

A change log table alone isn’t the right choice for most use cases, but there’s a surprisingly compelling case for the pattern, especially for dimensions. In fact, Netflix has used the change log pattern in production for years by running daily rewrites, similar to squashing old git commits into a single starting commit. The rewrites removed history but made the mirror queries fast.

The biggest takeaway is that directly writing changes to the mirror table isn’t always a good idea, notwithstanding the fact that it’s commonly an reflexive assumption and not a conscious choice. There’s a lot to gain from more flexible patterns.

The next post in this series covers transactional consistency. Transactional consistency is a powerful CDC best practice that helps you prevent correctness bugs.

Read CDC Data Gremlins