Don’t build an ETL application for a market of one.
As someone once famously said (?), “You can’t query what you can’t ingest.” And usually what you’re trying to ingest into Tabular / Apache Iceberg tables are large numbers of files kept in buckets in your object store (Amazon S3, Google Cloud Storage).
Continual file ingestion (vs. one-off jobs) is a necessary but frustrating data engineering nuisance, and a challenge at speed and scale. Usually you write a Spark job to pull data to run periodically to copy in the latest data.
Seems simple enough, but the devil is in the details. You have to ensure you’re not duplicating or dropping data across batches. To do that you have to orchestrate jobs to ensure that the last job has completed as expected. That’s not really a problem if you’re running things nightly, but if your users need lower latency – hours or minutes – it gets more complex as job runs occasionally overlap due to system failures or an unexpected peak in volume.
Also, your job will have to handle schema changes, either by alerting you to update your mapping or handling new fields in some standard way. Lastly, you need to convert files from their source format into a common format such as Parquet.
In short, we have these challenges:
- Exactly-once delivery
- Job orchestration across time (for low latency)
- Handling schema evolution
- File format conversion
Essentially you’re building an ETL application, albeit one with an extremely narrow market – your data consumers. But don’t feel bad, as you’re not alone, thousands of these bespoke apps are being built and maintained, with everyone for the most part reinventing the same wheel. That’s a lot of data engineering time being spent on fairly low value work.
Tabular File Loader - Exactly Once, Low Latency No-code Ingestion
To avoid this becoming an issue with ingesting files into Tabular for use as Iceberg tables, we built a near-real-time file ingestion service called Tabular File Loader.
There are five cool things about File Loader you should know:
- It operates on a 5-minute micro-batch basis, so you can get low latency with no headaches. This is a fixed setting currently, but we will be making the batch window and maximum size configurable in the near future.
- It provides exactly once semantics, relieving you of hand-crafting checkpoints and running dedupe jobs.
- It optimizes the files in the target Iceberg table while it is ingesting the data, so you get exceptional out-of-the-box performance improvements versus a “dumb” file load. These improvements can reduce the size of the data by up to 80%. Performance is optimized based on the unique attributes of each table.
- It handles schema evolution, so new columns are automatically added to the target table, and it can coerce field types where it makes sense. It can even infer the initial schema based on what it detects in the source files. If a column is dropped in the source, the column is kept in the target and populated with NULLs.
- File Loader is serverless, so you don’t have to concern yourself with sizing and adapting your cluster, which can be especially vexing when data volumes swing wildly due to daily or seasonal externalities.
Other things you should know about File Loader:
- It supports Parquet, CSV, TSV, and JSON file formats, including complex data structures such as nested fields and arrays.
- You can configure jobs no-code (using the Tabular web interface) or with API calls which allows for programmatic use and software development best practices, such as revision control and change tracking.
- It operates at a high – albeit not infinite – scale. It has no problem keeping up with velocities as large as 10,000 files or 100 GB per minute. For ingestion at a greater scale you should consider using event streaming via Kafka Connect or Flink.
- It infers Iceberg partitions based on Hive directory structures, which can greatly ease migration from Hive to - Iceberg by allowing you to mirror your Hive tables easily.
- It takes advantage of Tabular’s centralized role-based access control (RBAC) so only users with the proper permissions can write data to a given table.
- Ingestion job monitoring and tracking information is exposed via an Open Metrics API so that it can easily integrate into your existing observability tooling.
- Tabular pricing is pay-as-you-go and based solely on the amount of data ingested, which makes the cost transparent and easy to estimate.
Hopefully this has convinced you that you never have to manually load files to Iceberg again! If you’d like to try out File Loader, sign up for a Tabular free tier account and give it a whirl on our sample data or your own S3 bucket. We also offer the following resources to help: