Apache Iceberg, Education, Features, Integration

Graph queries on Apache Iceberg tables

March 14, 2024

Addressing data duplication and scaling issues in graph databases

Authors: Brian Olsen, Danfeng Xu, Jason Reid, and Weimo Liu

TL;DR

Graph databases are limited in their scalability and interoperability when they rely on specialized storage systems. This post challenges the necessity of graph native storage systems, and proposes an alternative solution that leverages the physical layout and metadata of columnar Apache Iceberg tables via PuppyGraph, a graph query engine.

Sign up for Tabular and use PuppyGraph’s single-node Docker install free to get started now.

A reintroduction to graph databases

Graph databases are a valuable subset of the broader database landscape, providing an interaction model that makes it easier to reason about queries involving relationships between objects. Graph databases uniquely provide speedy traversal of data represented in a dense network, by providing a physical data layout to represent complex relationships using nodes and edges. Take for example, when you’re on a social media site like LinkedIn, and you want to know all of your second and third degree connections that work at a particular company. These databases are commonly used in analytics applications that power a diverse set of applications, such as social networks, recommendation engines, and fraud analysis.

Here is a visual example of a graph, modeled in PuppyGraph’s user interface. This graph is a small MindMap of concepts in an application which are the nodes, their properties are in the node properties, and how they relate to each other are modeled on the edges.

Zooming in on native graph storage

Despite their advantages, Gartner reports in 2022 show that the graph database market is valued at $2.6 billion, a small portion of the overall database and analytics market. For instance, the relational database and data warehouse markets are worth $68.5 billion and $30.2 billion respectively. The disparity between the current market share and the utility of graph modeling raises an intriguing question: Why haven’t graph databases captured more of the market?

One reason likely stems from the very feature said to define graph databases, native graph storage. Graph databases persist semantic data about objects and their relationships, commonly using the semantic triple as the atomic entity. Native graph storage implements a triplestore, or similar specialized physical data model, to index and store triples. Native graph storage systems aim to improve performance for graph queries, but they also restrict interoperability and scalability. Adopting a specialized storage layer requires data duplication across the graph and source systems. Also, it doesn’t scale efficiently for queries processing over tens of thousands of records a second. These drawbacks slow graph database adoption rates, and creates a vicious cycle resulting in a smaller ecosystem of graph tools, a steeper learning curve for tuning and improving these systems, which further encumbers adoption.

In this blog post, we aim to challenge some common assertions around graph databases that may be hindering their broader adoption. We will dive into innovative methods for implementing graph queries on Apache Iceberg tables over columnar Parquet files using graph semantics. We believe this fresh perspective opens up new opportunities for graph databases.

The value of graph models

Although the adoption of graph databases pales in comparison to the databases market, it’s crucial to avoid conflating the existing lack of interoperability or scaling to a lack of interest or utility. One of the most valuable aspects of a graph database are the domain specific query languages that make expressing the data or the queries, much more concise.

In fact, the ANSI SQL 2023 edition mostly consisted of adding support for graph representations and querying in SQL. Graph databases shine in their ability to intuitively map and query interrelated data for pattern recognition in a way that mirrors our natural understanding of relationships. This becomes especially important when reasoning and maintaining queries over complex networks embedded in the data. Graph query languages provide syntax that models scenarios where you are searching for patterns nested in the relationships between nodes. Attempting to query graph data using traditional SQL operations for all but low-degree graph data is cumbersome and verbose.

Graph query languages are advantageous for their ability to model complex traversal queries succinctly on deeply interconnected data networks. Native storage systems tightly couple the query and storage layer, are unproven at scale, and require learning new paradigms to adopt and maintain the system. With this understanding, let’s build a more concrete definition of the problem space .

The challenges that native graph storage presents

We’ve briefly mentioned the effects native graph storage has on graph databases. Let’s take a look at some of the challenges in more detail.

Scaling complexities

Scaling graph databases is difficult. As the number of nodes and edges grows, so does the complexity of the relationships. This can lead to increased computational overhead and challenges in horizontal scaling. Even after a graph database is implemented, scaling it out to handle more data or complex queries can be challenging. The interconnected nature of graph data means that adding more hardware does not always translate to linear performance improvements. In fact, it often necessitates a rethinking of the graph model or using more sophisticated scaling techniques.

Data duplication

The ETL processes necessary for migrating data from existing SQL data stores to graph databases add a layer of complexity. With native graph storage, these ETL pipelines are necessary. They can be time-consuming and resource-intensive to set up and maintain, and they require specialized knowledge. Continuous maintenance of these processes and the graph database where the data will flow requires ongoing attention and resources, especially as the data grows and evolves.

Graph data-modeling knowledge

Tooling for graph databases must inherently support graph operations and queries. The overlap in tooling between graph and SQL is for the most part non-existent. This means that many of the existing tools used with an organization’s current databases and infrastructure may not be directly applicable or optimal for use with graph data. This incompatibility leads to additional investments in new tools, and additional hours for users to integrate and learn the tooling.

Additionally, using graph databases demands a foundational understanding of mapping graph theory and logical modeling to an optimal physical data layouts or index. Given that graph databases are less commonly encountered in the tech industry compared to relational databases, this lower exposure can act as a considerable barrier to implementing an optimal solution with a traditional graph database.

Although existing graph databases offer powerful abstractions for high-degree relational queries, they carry significant implementation and scaling challenges. One way to address these shortcomings is to replace native graph storage with a storage layer that we know already scales. Luckily, Apache Iceberg handles columnar table representations using Parquet to support large-scale querying. Iceberg is a general purpose and interoperable SQL storage layer supported by a vibrant community and ecosystem.

For those less familiar, let’s quickly cover what Apache Iceberg is, and then how to run graph queries over it.

What is Apache Iceberg?

Apache Iceberg is a high-performance table format for large analytics datasets. Iceberg addresses concerns like maintaining atomic and consistent state between committing data changes to a table, tracking columnar files associated with a given table, and the schema enforced across those files over time. It expands the feature set for a growing number of query engines that relied on services like Hive Metastore and AWS Glue to track table state.

Iceberg aims to restore the separation of concerns which SQL has provided for databases since 1992 for modern cloud infrastructure, offering SQL-like capabilities over cheap and scalable cloud storage. By including features that enable modern data management techniques, Iceberg provides an invaluable foundation for organizations leveraging their data in the cloud, who need the freedom of multiple compute engines at scale.

Two critical design goals of Iceberg are to expose table metadata that improve query performance, and avoid stealing attention; it strives to be invisible. For instance, hidden partitioning simplifies query optimization by allowing the database to manage partitions without user intervention. Simultaneously, partitioning the data sets and exposing that information to query engines, allows the engine to minimize how much data it needs to query. In prior table format’s, partition evolution required a table migration, where Iceberg lets you change the partition scheme in place, without having to migrate any data. These modern storage-layer innovations along with the speed and scalability of cloud architectures are what make it possible for engines like PuppyGraph to reimagine graph databases.

At its core, Apache Iceberg is a community-driven table format spec with libraries written in Java, Python, and more recently Rust to provide a shared implementation of the spec for compute engines and ecosystem tools. But how does Iceberg help with modeling graphs for efficient graph traversal?

The power of PuppyGraph and Apache Iceberg

PuppyGraph is a graph query engine that allows developers to enable graph capabilities on SQL data stores. It supports a variety of table formats, including Apache Iceberg, Apache Hive, Delta Lake, plus many SQL data engines. The platform provides easy integration and, within minutes, allows users to leverage graph query languages such as Apache Gremlin and openCypher against their SQL data.

By separating storage and compute, PuppyGraph distinguishes itself by utilizing the efficiencies inherent in columnar data lakes, thereby offering substantial performance improvements at scale. Complex graph queries, such as those involving multiple-hop neighbor searches, frequently necessitate the joining and processing of a vast number of records. Columnar data storage optimizes read operations, enabling the swift retrieval of only the necessary columns for a query, eliminating the need to traverse entire rows.

Additionally, PuppyGraph enhances efficiency through the use of min/max statistics and predicate pushdown, significantly reducing the amount of data scanned. Furthermore, its alignment with vectorized data processing—where operations are performed on batches of values simultaneously—contributes to PuppyGraph’s ability to scale effectively and ensure rapid responses to intricate queries, thereby streamlining data analysis and improving overall query performance. Further, its automatically partitioned, distributed computing framework efficiently processes extensive datasets, ensuring robust scalability on both storage and computation.

PuppyGraph facilitates direct querying of your data in graph format, bypassing the need to duplicate or transfer SQL source data to a graph database target. It eliminates the need for the construction and maintenance of labor-intensive ETL pipelines typically required in traditional graph database configurations. It also simplifies data management by leveraging your current data store permissions, as there is not another copy of data. The cherry on top is that PuppyGraph operates within your own data center or cloud infrastructure, giving you complete control while ensuring compliance with any data governance policies you need to adhere to.

For individuals familiar with SQL and venturing into graph databases for the first time, PuppyGraph simplifies the process of data preparation, aggregation, and management by utilizing the data lake and tools they are already comfortable with. This design allows users to bypass the complexities of graph query languages for regular tasks, reserving these languages solely for specific graph-related inquiries like graph traversals. By streamlining these processes, PuppyGraph not only significantly reduces the learning curve but also boosts operational efficiency.

PuppyGraph enables companies to use their SQL data stores as they normally would, while reaping the benefits of graph-specific use cases such as complex pattern matching and efficient pathfinding. It avoids the additional complexity and resource consumption of maintaining a separate graph database and the associated ETL pipelines.

Get started for free

You can try out using PuppyGraph and Iceberg free-of-charge by utilizing Tabular’s PuppyGraph connector.
Sign up for Tabular and use PuppyGraph’s single-node Docker install free to get started now.