,

Thoughts from Current 2023 and streaming into Tabular Iceberg

snow drifts

On September 26th and 27th, Confluent hosted the second annual Current event. Current is an event that is focused on data streaming technologies, like Apache Kafka and Apache Flink. This was like a family reunion for me. Before joining Tabular as a Senior Developer Advocate, I had spent four years involved in the Kafka community, with two and a half years as a developer advocate at Confluent. It was great catching up with so many friends and colleagues in one place.

One might ask why, as a Tabular DA, I would go to a “streaming data” conference. I mean, Tabular is based on Apache Iceberg, and Iceberg is a batch technology, right? That’s a fair question, and I have a few answers.

  1. It’s a really good tech conference. It’s well organized, features dozens of interesting speakers, and draws over 2,000 data practitioners from around the world.
  2. I am on the program committee and had room-hosting commitments.
  3. I was presenting an intro to Kafka session to help fill in for a schedule gap.
  4. And most importantly, Current isn’t just about streaming data, it’s about data. And it’s data that makes the industry tick. Streaming and batch processing are two approaches to extracting value from data, but it’s the data that holds the value. This idea of “streaming vs. batch” is beginning to fade.

A look at some of the sessions demonstrates this. Here are a few that I couldn’t catch due to my hosting duties, but I plan to watch the replays on the Current website:

  • Datalake Rock Paper Scissors: Iceberg + Flink or Iceberg + Spark?, by Sitarama Chekuri and Ben de Vera from Bloomberg
  • Off-Label Data Mesh: A Prescription for Healthier Data, by Adam Bellemare from Confluent
  • Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lakehouse, by Frank Munz from Databricks
  • From Raw Data to an Interactive Data App in an Hour: Powered by Snowpark Python, by Vino Duraisamy from Snowflake

One session that I really enjoyed was the Streaming Solutions Showdown, moderated by Danica Fine from Confluent and featuring Holden Karau from Netflix on Apache Spark, Gordon Tai from Confluent on Flink, Matteo Merli from StreamNative on Apache Pulsar, and Sophie Blee-Goldman from Responsive on Kafka Streams. Each participant advocated for their preferred technology, which gave us all a great opportunity to compare and contrast. And, of course, Danica did an excellent job keeping everyone focused and preventing outright brawls. But what stood out to me was the presence of Holden, who is always a treat to listen to, representing Spark. It was a confirmation of the blurring of the line between batch and streaming.

I saw more of the same in the Expo Hall, where I spoke with Confluent, Decodable, MongoDB, RisingWave, MinIO, and more. Many of them have or are planning to integrate with Iceberg.

Speaking of Iceberg, let’s take a slight detour for my data streaming friends who may not be familiar with it. Iceberg is an open table format specification. Implementations of Iceberg interact with a cloud object store and provide table access to enormous amounts of data spread across many files. Essentially, it turns buckets full of data files into tables that can be queried efficiently.

Ok, my streaming friends, I see your eyes glazing over… “This is all batch, and batch is dead”. I understand. I used to ride that “streaming is better than batch” bandwagon. But I realized that I was missing the bigger picture. All that data streaming through our Kafka clusters, whether for streaming ETL pipelines, event-driven microservices, or any of the many uses for streaming data, has to end up somewhere. And, sure, real-time data has great value, but by collecting, aggregating, and organizing our data, we can extend its life span and gain even more value down the road. That’s where Iceberg shines. That’s why so many players in the data space are integrating with it. Iceberg serves as a conduit for both fresh and historical data, making it available for vibrant batch analytics use cases.

Now, back to the conference. While there was obviously more of a focus on real-time streaming at this event than there would be at, say, Data & AI Summit, there was definitely some of both. I think we will continue to see this at every major data industry event. Batch and streaming are powerful methods organizations can use to put data to work in making better-informed decisions and providing better user experiences.

Now, for the big news at Current 2023, Confluent unveiled their managed Flink offering, which they had hinted about last year. While not the first company to offer managed Flink, they are the largest. Not be outdone, Decodable, another managed Flink provider, hosted the Flink Forest in the Expo Hall. It was a cool place to hang out with live trees and backyard games.

The Current organizers did a decent job of including many other technologies in the data streaming space, including Pulsar, Materialize, Spark Streaming, and more. Still, Kafka and Flink definitely were center stage, which makes sense for this event.

Kafka and Flink work well together, and they also work well with Iceberg, using Tabular’s Iceberg Sink connector for Kafka Connect for streaming ingestion or to implement various CDC patterns. Stay tuned for some upcoming posts on these subjects here.

I am enjoying my new adventure at Tabular and looking forward to exploring and learning about the synergies between batch and streaming technologies. There are so many exciting advances in the data industry, with faster compute, more efficient and affordable storage, ever-increasing cloud capabilities, and the patterns and architectures that are being built to take advantage of it all.

It’s a great time to be doing what we do

Related Posts