One of the reasons we started Tabular is that data engineers are asked to do far too much unnecessary work. These tasks range from simple annoyances — explaining again that you have to add extra filters! — to major distractions, like rewriting every single query that uses a table when partitioning needs to change. Tabular’s mission is to build better tools that avoid these problems or automate them so data engineers can focus on delivering business value.
In this post, I want to explore the tasks that data engineers should no longer need to worry about and how you can stop wasting time on them.
Anyone that has worked with Hive-like tables can tell you how important for performance it is to understand partitioning. The defining trait of Hive-like tables is using a directory structure to organize data and exposing that structure as special table columns. This simple model is effective for speeding up queries, but hinges on manually adding extra filters. For example, data engineers commonly see questions about queries like this:
> SELECT count(1) FROM events WHERE event_ts >= today() 4935 Query completed in 1h 4m 22s <-- 😱
Why did this take over an hour to count less than 5,000 records? Because without an extra filter on the
event_date partition column, the query had to read the entire table! Analysts and other data consumers get tripped up by this all the time and have to be trained to watch out for it. And it’s not easy to trick BI tools to do this either.
Formats like Iceberg avoid this problem with hidden partitioning that keeps track of the relationships between columns and the partition layout and can filter data files automatically. There’s no need to teach people to work around a broken model.
Cleaning up mistakes
Data engineers also spend too much time fixing avoidable mistakes. The problem isn’t that people make mistakes — that’s hardly avoidable. The problem is that data infrastructure allows them, and that fixing those mistakes is incredibly time consuming.
A simple example is schema evolution in Hive-like tables: sometimes it’s safe to rename or drop a column… and sometimes it’s not. If you get it wrong, you not only have to pause to clean up any data that was written after the change, but also track down and fix places that used the bad data.
Hive partitioning introduces similar problems. It is the data engineer’s responsibility to assign each row to the right directory by providing partition column values. If you use event time instead of processing time or get the time zone wrong, you’ll have to spend a few days fixing the mess.
Both of these types of mistakes are avoidable: updating a schema should always be safe, and data layout or partitioning should be automatic to avoid human error.
This probably sounds familiar because it is exactly what SQL tables do. The killer feature of a table is that it separates responsibility: data engineers focus on the logical data, while databases handle the low-level details.
Another headache left to data engineers is making copies of datasets. Not only does this require someone to babysit a scheduled job that maintains a copy, it also introduces hard synchronization and governance questions: Is the copy reliable? When was it updated? How long do updates take to propagate? How are access controls to the copy managed? It’s better to share rather than to copy.
Why are data engineers making copies? There are many reasons, but the most common ones are becoming obsolete:
- Warehouse copies – for some use cases, like BI, there’s nothing like a data warehouse to meet the strict performance demands and handle wacky auto-generated SQL queries. With Iceberg gaining wide support, warehouses can now share tables and run queries directly on the source of truth.
- Alternative layouts – partitioning provides a primary index for speeding up queries, but it is limited in Hive-like tables because the number of partitions quickly introduces too much overhead. Iceberg indexes metadata to easily handle millions of partitions and uses column stats to prune out even more data files. Iceberg easily handles multiple query patterns without needing alternative copies.
- Migration – the last reason to copy data is migration. What happens when data volume increases and partitioning needs to be changed? In Iceberg, you can update the table partitioning in place. No need to copy data into another table and no need to rewrite every query for the new partition layout.
The critical component is flexibility: tables should adapt to different uses and must be compatible across many open source and commercial tools. You should use the right tool for your task or your team — without needing to move data around.
It may seem like I’m talking about Iceberg a lot, but I actually don’t want people to care. I want people to know that these are solved problems so that we can collectively focus on the next set of challenges and tasks that should be solved by great data infrastructure.
Now that we have a format like Iceberg that solves the fundamentals, where do we focus next?
That’s the subject of my next post, Part 2: What are the problems that data engineers should be able to stop worrying about?