DATA ENGINEERING
This recipe builds on branching and tagging basics to implement a powerful pattern commonly referred to as Write – Audit – Publish (WAP) for integrating data quality auditing into your pipelines. Just as Apache Iceberg branching and tagging are similar to git, the WAP pattern mimics common Continuous Integration and Delivery (CI/CD) workflows.
The basic pattern is as follows:
Write the data – Commit the changes to a staging branch (instead of main); Audit the data – Run data quality checks and other validation against the branch; Publish the data – If checks pass, fast-forward the staging branch onto the main branch.
Implementing the WAP pattern allows data producers to validate all changes to an Iceberg table before making those changes available to downstream consumers. This can prevent situations of incorrect dashboards, reduce data quality issues that lead to data downtime, and help build trust in data throughout an organization.
Implementing WAP
As an example, you can enhance the MERGE recipe by using the WAP pattern.
Writing to a staging branch
You can write directly into an Iceberg branch by using the branch_<branch_name>
syntax (docs), but the Iceberg Spark extensions include a convenient mechanism to enable the WAP pattern in configuration. This makes it very easy to apply the pattern to existing Spark jobs without needing to alter production code. Once these are set, you can run the table update logic without modification.
To enable the WAP pattern for a table and a Spark job, first set write.wap.enabled
on the table.
ALTER TABLE accounts SET TBLPROPERTIES ('write.wap.enabled'='true')
Second, set the Spark session property spark.wap.branch
to a unique name for each run. This is often added to job-specific configuration passed with --conf
to spark-submit
or spark-sql
.
spark-sql \
--name 'Update accounts from transfers' \
--deploy-mode cluster \
--conf spark.wap.branch=merge_new_transfers \
--define last_processed_ts=${last_ts} \
-f merge-new-transfers.sql
With the WAP branch set in the Spark job’s configuration, writes to tables with the pattern enabled will be staged in the branch instead of updating the current table state. Note that the only change to the job was adding --conf
with a branch name. The SQL script stayed the same:
-- merge-new-transfers.sql (from the MERGE recipe)
MERGE INTO accounts AS t -- Target rows
USING transfers AS s -- Source rows
ON t.account_id = s.account_id AND -- ON clause
s.ts > ${last_processed_ts}
-- Action cases
WHEN MATCHED AND s.amount IS NULL THEN
DELETE
WHEN MATCHED THEN
UPDATE SET t.balance = t.balance + s.amount
WHEN NOT MATCHED THEN
INSERT (t.account_id, t.balance) VALUES (s.account_id, s.amount)
Auditing staged data
After the job completes, the merged data will be in the merge_new_transfers
branch and not in the main
branch. Standard queries against the accounts
table are not yet aware of these changes. This gives us the opportunity to verify that the merge has made the intended changes.
How you audit the data is up to you. The nice thing about the data being staged in an audit branch is that you can do it within the same ETL job, or have another tool do it. If you perform the auditing within the same Spark job, then any queries against the accounts
table will use the branch specified by spark.wap.branch
. Otherwise, you can directly query the staged data using either FOR VERSION AS OF
or branch_<name>
.
Here is a simple validation that the MERGE has not produced any duplicate values.
-- In the same spark session you can query the table directly
SELECT id, count(*)
FROM accounts
GROUP BY id HAVING count(*) > 1;
-- Otherwise, you can query the branch directly
SELECT id, count(*)
FROM accounts FOR VERSION AS OF merge_new_transfers
GROUP BY id HAVING count(*) > 1;
SELECT id, count(*)
FROM examples.accounts.branch_merge_new_transfers
GROUP BY id HAVING count(*) > 1;
Publishing staged data to the main branch
If the audits are successful, then similar to merging a git pull request you can fast-forward all updates made to the staging branch to the table’s main branch. The recommended method is using the fast-forward Spark procedure.
CALL catalog_name.system.fast_forward('accounts', 'main', 'merge_new_transfers')
If the audits are unsuccessful, then you can easily drop the staging branch.
ALTER TABLE accounts DROP BRANCH merge_new_transfers;
Benefits of write-audit-publish
There are a number of benefits to this pattern:
- Works with any data quality tool – You can plug in any data quality tool you want or multiple different tools where it makes sense.
- The complete table state is available – You have the full potential state of the table available for more sophisticated validation. For instance, you can look at statistical distributions or perform anomaly detection. You can even compare the staged data against the current state of the table in the main branch.
- Humans are in the loop – WAP allows you to easily put a human in the data validation loop. In practice it is common to have some tests that will automatically fail the job and drop the staged data and others that will alert a data engineer to take a closer look. Perhaps data volumes jumped because of a holiday or significant business event. In that case, a human can decide that the data is valid and then manually run the publish step without having to re-run potentially expensive updates.
- Validate multiple tables – Data pipelines often involve more than one table. It is trivial to extend the WAP pattern to stage updates to multiple tables using a common branch name and be able to do cross validation. Then if all the data looks good, publish all the tables at once.
Moving beyond the WAP pattern, higher quality and trust make data-driven orchestration possible. In a world where you can trust data that lands in the main branch of a table as the result of a closed-loop audit process, data pipelines and reports can be based on data readiness rather than merely job success signals. This builds on the WAP pattern and uses Iceberg tags to mark validated snapshots. For instance, you might use daily tags to signal that data for a given day has been processed and audited. The tags provide a clear signal to all downstream processes and consumers that the data is “ready for consumption,” even if they are not using a common orchestration tool.