November 8, 2024

Why Open Table Format Architecture is Essential for Modern Data Systems

By Pratik Datta

The world we live in today presents larger datasets, more complex data, and diverse needs, all of which call for efficient, scalable data systems. Though basic and easy to use, traditional table storage formats struggle to keep up. Open Table Format (OTF) architecture now provides a solution for efficient data storage, management, and processing while ensuring compatibility across different platforms.

In this blog, we will discuss:

  • What is the Open Table format (OTF)?

  • Why should we use it?

  • A Brief History of OTF

  • A comparative study between the major OTFs.

What is an Open Table Format?

Open table formats can be considered as wrappers or encapsulations around our data storage which uses a set of files to:

  • Record schema and partition (DDL) changes in your tables.

  • Track data files within the table along with their column statistics.

  • Log all Inserts, Updates, and Deletes (DML) applied to the table.

Open table formats enable efficient data management and retrieval by storing these files chronologically, with a history of DDL and DML actions and an index of data file locations.

The most used open table formats currently are Apache Iceberg, Delta Lake, and Apache Hudi. These formats are transforming how organizations manage large datasets. These systems are built on open standards and offer immense analytical and transactional processing flexibility. 

Why are They Essential?

Adopting an Open Table Format architecture is becoming indispensable for modern data systems.

Schema Evolution

Data structures are rarely static in fast-moving environments. Open Table Formats architectures like Apache Iceberg and Delta Lake support schema evolution, enabling organizations to modify table schemas – adding, removing, or renaming columns—without disrupting downstream processes or creating compatibility issues. 

Below is an example using an Iceberg table.

First, we create an Iceberg table in Snowflake and then insert some data.

Then, we add another column called HASHKEY, add more data, and locate the S3 file containing metadata for the iceberg table.

If we now debug the metadata file of the table (screenshot below), we see that there are two schemas – out of which, we can see the initial "schema id" : 0 (in yellow) is associated with the initial snapshot  "sequence-number" : 1  and the altered  "schema id" : 1  (in green) is associated with the latest snapshot "sequence-number" : 2 .

Hence, the metadata files record schema and partition changes, enabling systems to process data with the correct schema and partition structure for each relevant historical dataset.

Data Versioning and Time Travel

Open Table Formats empower users with time travel capabilities, allowing them to access previous dataset versions. It enables analysts and data engineers to “go back in time” and investigate how data looked at specific points, a critical feature for industries with stringent audit requirements, such as finance, healthcare, and e-commerce.

We will use the same example as above to explain. In the screenshot below, we can see that the metadata file for the Iceberg table retains the snapshot history. Each snapshot has a separate manifest file that keeps track of the data files associated with that snapshot and hence can be restored/queries whenever needed.

Versioning also ensures a safer experimentation environment, where data scientists can test new models or hypotheses on historical data snapshots without impacting live data.

Note: Cloud Data warehouses like Snowflake and Big Query already have a default time travel feature. However, this feature becomes an absolute must-have if you are operating your analytics on top of your data lake or lakehouse.

Data Consistency and ACID Compliance

Open Table Formats support ACID (Atomicity, Consistency, Isolation, Durability) transactions, ensuring that data remains consistent across multiple processes. This feature is essential in environments where multiple users or applications access, modify, or analyze the same data simultaneously. By embracing Open Table Formats with ACID capabilities, organizations can ensure the integrity of their data and enable concurrent processing without compromising quality.

We will use the same Iceberg table from the above two examples to explain.

In the below example, we have submitted two DMLs (INSERT) queries at the same time.

The first insert statement loads data having c_custkey between 30001 and 40000  –

				
					INSERT INTO ib_customers2 SELECT *, '11111111111111' AS HASHKEY 
FROM snowflake_sample_data.tpch_sf1.customer WHERE  c_custkey BETWEEN 30001 AND 40000;


				
			

The second insert statement loads data having c_custkey between 40001 and 50000

				
					INSERT INTO ib_customers2 SELECT *, '22222222222222' AS HASHKEY 
FROM snowflake_sample_data.tpch_sf1.customer WHERE  c_custkey BETWEEN 40001 AND 50000;
				
			

When we query the data in the table, we can observe that the relevant records have been properly loaded, as shown below:

Multi-Platform Compatibility and Interoperability

When we query the data in the table, we can observe that the relevant records have been properly loaded, as shown below: A hallmark of Open Table Formats is their compatibility with multiple processing engines. Platforms like Apache Spark, Trino, and Snowflake can seamlessly read from and write to data stored in an OTF, enabling data engineers and scientists to utilize their preferred tools without compatibility concerns. This flexibility allows organizations to diversify their data ecosystems without vendor lock-in, ensuring they can leverage the best tools available for each task.

Interoperability is a game-changer for organizations working with diverse technology stacks. It empowers them to migrate data, adopt new analytics tools, or scale their infrastructure without reformatting or restructuring data, preserving time and resources.

Storage and Query Optimization

An Open Table Format architecture typically includes optimizations for data storage and access. For instance, partition pruning, data skipping, and columnar storage formats (like Parquet and ORC) allow efficient data retrieval, reducing scan times and query costs. This is invaluable in big data environments, where unnecessary scans can significantly drain resources.

Partitioning and clustering features inherent to OTFs allow data to be stored in a manner that enhances query performance. This is particularly beneficial in complex analytical queries, where processing smaller, targeted segments of data results in quicker and more efficient query execution.

Cost Efficiency and Scalability

Open Table Formats are designed to work with cloud storage solutions like Amazon S3, Google Cloud Storage, and Azure Blob Storage, enabling cost-effective and scalable storage solutions. Additionally, the optimized query execution and data pruning features reduce the compute cost associated with querying large datasets.

Scaling data infrastructure while maintaining efficiency is one of the primary challenges of modern data architecture. By reducing unnecessary compute cycles and enabling efficient data pruning, OTFs significantly lower costs, making it feasible for organizations to scale without disproportionately increasing expenses.

Open Source Flexibility and Community Support

Open Table Formats are part of the open-source ecosystem, benefiting from a wide, active developer community. This fosters a high rate of innovation as enhancements and new features are continually added. Open-source tools also offer a high level of customization, allowing organizations to tailor OTFs to meet their specific needs.

With the collective power of the open-source community, Open Table Formats remain at the cutting edge of data architecture, evolving to support emerging trends and addressing the limitations of previous systems.

Streaming and Real-Time Data Processing

As organizations increasingly demand real-time data insights, Open Table Formats offer strong support for streaming data processing, allowing organizations to seamlessly merge real-time and batch data. Apache Hudi, for example, provides efficient upsert (update and insert) capabilities essential for real-time data ingestion pipelines. Delta Lake’s optimistic concurrency control further facilitates handling high-velocity data streams, providing low-latency insights into critical business metrics.

Evolution of Open Table Formats

Here’s a timeline that outlines the key moments in the evolution of open table formats:

2008 - Apache Hive and Hive Table Format

Facebook introduced Apache Hive as one of the first table formats as part of its data warehousing infrastructure, built on top of Hadoop. The Hive format helped structure and partition data within the Hadoop ecosystem, but it had limitations in terms of flexibility and performance.

2013 - Apache Parquet and ORC

These columnar storage formats were developed to optimize storage and speed within distributed storage and computing environments. Parquet (developed by Cloudera and Twitter) and ORC (developed by Hortonworks and Facebook) improved data storage efficiency and query performance, becoming popular formats for large data workloads. However, they lacked support for transactional features, like updates and ACID compliance.

2016 - Apache Hudi

Originally developed at Uber, Hudi introduced a format that allowed upserts (inserts and updates) on data lakes, supporting use cases with frequent data modifications, such as CDC (Change Data Capture). It provided ACID transactions and built-in support for real-time analytics.

2017 - Apache Iceberg

Developed by Netflix, Iceberg addressed challenges like managing large datasets, schema evolution, and time travel (the ability to query historical data). It supported ACID transactions and introduced features to optimize data lakes, like handling large-scale partitioning and simplifying metadata management. It can also be integrated into major data platforms like Snowflake. 

2019 - Delta Lake

Databricks released Delta Lake as an open-source project. Built on Parquet, it added support for ACID transactions, data versioning, schema enforcement, and time travel. Delta Lake became popular for making data lakes more reliable and easy to manage.

2021 - Iceberg and Delta Lake Gain Traction in the Industry

Apache Iceberg, Hudi, and Delta Lake continued to mature with support from major cloud providers, including AWS, Google Cloud, and Azure. These formats became central to modern data lakehouse architectures, which blend data lakes’ flexibility with data warehouses’ reliability.

2023 - Lakehouse Table Format Unification Efforts

In recent years, discussions on standardizing open table formats have emerged to improve interoperability across platforms and cloud providers. Delta Lake with Uniform has been launched, allowing users to read data stored in Delta as if it were Apache Iceberg or Apache Hudi.Apache’s Xtable (formerly OneTable) is a cross-table converter for table formats that supports widely adopted open-source table formats such as Apache Hudi, Apache Iceberg, and Delta Lake.

These open table formats drive innovation in big data and data warehousing.

A Comparative Analysis of Major OTFs

Below is a brief comparison of the major OTFs now (Iceberg, DeltaLake, and Hudi) and the “OG” OTF Hive –

Feature

Apache Hive

Apache Iceberg

Delta Lake

Apache Hudi

Transaction support
(ACID compliant)

Limited

Yes

Yes

Yes

File format support

Parquet, ORC, Avro, and more

Parquet, ORC, Avro

Parquet

Parquet, ORC, Avro

Schema evolution

Partial

Full

Partial

Full

Platform interoperability

Yes, but difficult

Yes

Yes

Partial

Time travel support

No

Yes

Yes

Yes

Concurrency management

Pessimistic locking

Optimistic locking

Optimistic locking

Optimistic locking

Cost efficiency and scalability

No

Yes

Yes

Yes

Real-time data processing

No

Yes

Yes

Yes

Closing

Open Table Format Architecture is a cornerstone for modern data systems, offering unmatched flexibility, scalability, and interoperability. Investing in open table formats unlocks richer, faster insights and fosters a collaborative data culture where teams can experiment, build, and iterate without barriers. The modern data landscape is complex, but companies can use the right architecture to turn that complexity into their greatest strength. 

Are you ready to advance your data strategy? Contact phData for guidance on implementing Open Table Format Architecture, optimizing data storage, and ensuring seamless integration across platforms.

FAQs

A data lakehouse integrates the best features of a data lake and a data warehouse, creating a hybrid architecture that can manage structured and unstructured data using open data formats and allows users to access data using any tool. They also support ACID transactions, ensuring data integrity and stored data reliability. Data lakehouses combine the strengths of schema-on-write and schema-on-read approaches, enabling the management of both structured and unstructured data for sophisticated real-time analytics. With the introduction of  SQL capabilities, they are accessible to users who are accustomed to querying relational databases

An External Table is a database table that provides access to data stored outside of the database, often in files located in external storage like cloud-based data lakes (e.g., Amazon S3, Azure Data Lake, or Google Cloud Storage). Unlike traditional tables, where data is managed directly within the database system, external tables allow users to query data in its original storage location without importing it into the database

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit