Introducing Tabular

blog-image

Introducing Tabular

I’m happy to announce that Dan Weeks, Jason Reid, and I have raised a series A from a16z to build a new kind of data platform around Apache Iceberg. Iceberg is the open standard for huge analytic tables in the cloud, and it’s currently used at companies like Netflix, Apple, Adobe, LinkedIn, Expedia, Stripe, and more.

Now, if you’d asked me in February whether I wanted to start a company this year, the answer would have been a definitive: nope! My wife and I were expecting baby #2 in less than a month, so it would be a ridiculous time to start a company. But here I am a few months later, introducing Tabular.

Before we get to Tabular, it’s important to understand a little bit more about the motivation for building Iceberg.

People shouldn’t fight infrastructure

An engineer friend of mine once complained, “we’re not cogs—we’re artisans!” The phrase has stuck with me for the last decade because I’ve seen that frustration many times. Data engineers and data scientists exhaust far too much energy fighting the shortcomings of their data infrastructure. These shortcomings come in two main categories.

First, data lakes are full of pitfalls and frustrations that force people to become experts in quirky limitations instead of getting things done. Dropping a column can silently corrupt query results, or not knowing to add redundant filters to a query can waste days of an analyst’s time, not to mention racking up cloud costs.

Second, the big data ecosystem has been pawning problems off on the wrong people. The people using these technologies should be focused on building relevant and reliable data products, but instead they’re forced to waste time worrying about how many files their SQL produces. Data infrastructure should do more instead of needing people to shore up its many gaps.

We think that saving people time and removing headaches is a crucial next step in data infrastructure. It is far better than saving compute time. Not (just) because people cost more, but because happy people are better engineers. When you’re happy and empowered by the tools that you use, you’re going to do a better job at whatever it is you do, from vaccine research to building rockets to entertaining the world.

As a result, the core of Iceberg’s philosophy is keeping people happy: data infrastructure should work without unpleasant surprises. Most of the time, it should be invisible.

First step: fix tables

Dan and I created Iceberg because we recognized that the table format was the common thread in many, if not most, frustrations and problems—problems that were heightened by running Netflix’s early cloud-native data platform on S3. Without atomic commits, every change to a Hive table risks correctness errors elsewhere, and so automation to fix problems was a pipe dream and maintenance was left to data engineers. Iceberg tackled atomicity to make automation possible, even in cloud object stores.

Crucially, Iceberg didn’t just make commits safe and reliable. Going back to our philosophy of avoiding unpleasant surprises, we sprinkled in best practices from the SQL world. We added reliable schema evolution that doesn’t require rewriting data, made partitioning invisible to data consumers, and moved configuration to tables where it belongs.

As a result, Iceberg has been more transformative than we anticipated. At Netflix, we made data flowing in from Kafka available in minutes rather than hours. We replaced an in-memory database with a multi-petabyte Iceberg table and Trino at a fraction of the cost. We built services to automate compaction, migrate across AWS regions, tidy up metadata, and enforce retention policies. And we finally stopped getting support calls after renaming a column broke something.

There are even more examples of the rewards of good design, like time travel, metadata tables, and the potential to branch and tag snapshots. Those will have to wait for future posts.

Along the way, Netflix donated Iceberg to the Apache Software Foundation, where the majority of the innovation has happened. It’s amazing what the Apache Iceberg community has built. Iceberg’s continued success depends on a diverse community building a common and open standard. I’ll have more on this in a later post, but we pledge to support and contribute to the independent community—never to control it or do harm. I’m proud to be a part of it and happy that Tabular will also support and participate in the community.

That brings me back to February, an upcoming baby, and quitting my job.

An independent data platform

The Iceberg community is building something amazing that is already changing how we work. Iceberg enables you to work seamlessly in multiple processing engines, each of which is focused on being the best in its niche. Iceberg finally makes it possible to build data services to manage tables, instead of expecting people to coordinate scheduled maintenance jobs. And because Iceberg is truly open—you can participate in its evolution, not just see the code—it is becoming the ubiquitous standard for cloud-native tables.

This transformation creates a need for an independent data platform. Instead of being tied to a particular processing engine or cloud provider, we need a data layer that works well with all of them and is unbiased. Products like Snowflake are great at what they do, but no single engine can be great in every situation. Consequently, storage and management benefit from being decoupled from any single compute framework to increase flexibility. That’s why an open standard like Iceberg is vitally important.

The potential to build a fundamentally better data platform that is independent, cloud-native, and actively maintains data is why my co-founders and I are building Tabular. And that’s why I’m strangely excited to throw my life into chaos for a while.

What does this new data platform look like?

Tabular will start with a painless way to get running with Iceberg. Using Iceberg requires running some infrastructure. At Netflix and similar tech companies, that’s okay because those of us building Iceberg are also providing that infrastructure. But Iceberg shouldn’t be limited to companies with big infrastructure teams. Plus, even companies with those teams have better things to do than retool to roll out Iceberg support.

Tabular will also make data maintenance and optimization problems disappear. With Iceberg, we can now safely build services that actively manage tables. The data platform can take on more responsibilities for compaction, clustering, configuring, indexing, and more. It is time to build a platform that lets people forget about concerns like file sizes or formats. People should focus on building data products and answering questions.

If you’re excited about what we’re building, check back here periodically. In the meantime you can join us in the Apache Iceberg community.