Skip to content
Blueprint Technologies - Data information specialists
Main Menu
  • What we do

      Artificial Intelligence

      Intelligent SOP
      Generative AI
      Video analytics

      Engineering

      Application development
      Cloud & infrastructure
      Lakehouse optimization

      Data & Analytics

      Data platform modernization
      Data governance
      Data management
      Data migration
      Data science & analytics

      Strategy

      TCO planning
      Productization
      Future proofing
  • Industries

      Manufacturing

      Enhance productivity and efficiency through tailored technology solutions, optimizing processes, and drive innovation in manufacturing operations.

      Retail

      Revolutionize customer experiences through innovative technology solutions for seamless shopping journeys and enhanced retail operations.

      Health & Life Sciences

      Advance healthcare outcomes and pharmaceutical innovations through cutting-edge technology solutions and data-driven strategies.

      Financial Services

      Empower financial institutions with secure and scalable technology solutions, driving digital transformation, and personalized customer experiences.

  • Databricks

      Databricks
      Center of Excellence

      Maximize your Databricks experience with our comprehensive Center of Excellence resources and support.

      QuickStarts

      Proof-of-value projects designed to get you started quickly on Databricks.

      Accelerated Data Migration

      Regardless of the source, we specialize in migration your data to Databricks with speed and quality.

      Unity Catalog Migration

      Accelerate your UC migration and minimize errors with our meticulously tested Brickbuilder approved solution.

      Lakehouse Optimizer

      Get higher return on your investment and minimize your total cost of ownership with self-facilitated optimization.

      Accelerated Snowflake to Databricks Migration

      Unlock increased cost savings, heightened operational efficiency, and enhanced analytical capabilities. 

  • Our work
  • Insights
  • About

      Our Approach

      Discover our holistic approach to uncovering strategic opportunities.

      Careers

      Explore exciting career opportunities and join our team today.

      News

      Get the latest updates and insights about our company.

      Events

      Stay updated on upcoming events and webinars.

      Our Partners

      Get to know our trusted technology partners and collaborators.

Connect
Blueprint Technologies - Data information specialists

The fast & furious: Lakehouse platform performance comparisons

By Gary Nakanelua

Performance is a defining factor in an organization's data estate strategy.

Open-source lakehouse platforms have emerged as a powerful catalyst for data-driven innovation, offering organizations a cost-effective, flexible, and scalable solution to manage and analyze diverse data. By leveraging open-source technologies, businesses benefit from a global developer community’s collective expertise and contributions, ensuring continuous improvement, feature enhancements, and up-to-date security measures. Open-source lakehouse platforms enable seamless integration with a wide array of advanced analytics and machine-learning tools, allowing organizations to extract valuable insights and make data-driven decisions at an accelerated pace. Embracing open-source lakehouse platforms facilitate an agile, adaptable approach to data management and analytics, empowering companies to stay ahead in a highly competitive and dynamic business environment.

In my experience, while organizations of all sizes conceptually understand the positive impact of embracing open-source platforms, adoption is directly influenced by clarity of the performance of proposed platforms in production environments. Rational business leaders want to avoid frantic phone calls at two in the morning informing them that data workloads in production have slowed to a crawl and deadlines are at risk.

In the world of open-source Lakehouse platforms, three have risen in popularity: Apache Hudi, Apache Iceberg, and the Linux Foundation’s Delta Lake. Each has well-documented adoption and continued contributions by large organizations. Uber initially developed Hudi, Iceberg started at Netflix, and Delta Lake originated at Databricks. Each has its benefits and deserves a deeper evaluation of its capabilities as they pertain to an individual organization’s needs. However, for this article, I’m going to focus on one: performance.

Benchmark X

Performance measurement requires one or more benchmarks. For example, Dodge recently unveiled its new Challenger SRT Demon 170, a beast of a modern muscle car with a 0-60 MPH time (benchmark) of 1.66 seconds and a quarter mile time (benchmark) of 8.91 seconds. If you’re Vin Diesel and living life a quarter mile at a time, that’s fast living. Compare that performance against the Toyota Corolla, one of the best-selling cars in 2022. The Toyota Corolla moves from 0-60 MPH in 8.6 seconds and achieves the quarter mile in 16.85 seconds. The numbers don’t lie; the winner is clear if performance is the critical benchmark.

In the world of technology industry benchmarks, there are a few key criteria:

  1. The benchmark specification is published and accessible
  2. The code utilized to perform the benchmark tests are open and accessible
  3. Any datasets used in the benchmark tests are publicly available

For industry acceptance, a non-profit organization must consistently develop, evaluate and refine the benchmarks. The organization should have a diverse group of members and affiliates representing the interest of customers and the industry.

Lakehouse platforms: Benchmark drift

Lakehouse platforms, which combine the best features of data lakes and data warehouses, are typically benchmarked using a combination of industry-standard benchmarks and custom benchmarks to assess their performance, scalability, and efficiency.

The TPC-DS benchmark, introduced in 2011 by the Transaction Processing Performance Council (TPC), was designed to evaluate the performance of modern, large-scale data warehousing systems. It replaced the earlier TPC-H benchmark and introduced a more diverse and complex set of queries to represent real-world decision support and business intelligence workloads. The history of the TPC and the details of the TPC-DS benchmark are beyond the scope of this article, but here is an excellent primer on both.

A team from UC Berkeley recently open-sourced a Lakehouse benchmark that adapts the TPC-DS data warehouse benchmark specification to a lakehouse setting. Named LHBench, the benchmark consists of four tests (including aforementioned TPC-DS) with the source code available in a public Github repo and the raw TPC-DS test dataset provided as Apache Parquet files in an AWS S3 bucket.

Fate of the Furious Three

With a benchmark specification, open source code to perform the benchmark, and a dataset with which to perform the benchmark, a team set out to compare the performance of Hudi, Iceberg, and Delta Lake. They ran all tests using Apache Spark on AWS EMR 6.9.0 storing data in AWS S3 using Delta Lake 2.2.0, Hudi 0.12.0, and Iceberg 1.1.0. They shared their findings in a white paper titled Analyzing and Comparing Lakehouse Storage Systems released at the 2023 Conference on Innovative Data Systems Research (CIDR). There is an incredible amount of detail in the white paper and I encourage anyone reading this article to read the white paper in its entirety.

In every test performed in the benchmark, Delta Lake was faster than Hudi and Iceberg.

For example, in data load performance from the TPC-DS benchmark, Delta Lake is slightly faster than Iceberg and nearly 10X faster than Hudi. 

In query performance from the TPC-DS benchmark, Delta Lake is 1.4x faster than Hudi and 1.7X faster than Iceberg.

The load and query performance differences can be attributed to several factors including:

  • Delta Lake’s more efficient columnar compression and lower overhead for large table scans due to larger file sizes, which results in fewer files to read compared to Hudi
  • Delta Lake’s utilization of the default Spark reader, which performs better than the custom-built Parquet reader used by Iceberg

Conclusion

There are a variety of considerations that organizations must consider when choosing a lakehouse platform, such as data ingestion, metadata storage, and transaction coordination that will impact performance. If your organization is getting started on a Lakehouse evaluation and performance is your primary deciding factor, I recommend running the LHBench benchmark against all three Lakehouse platforms. If your organization is currently on Hudi or Iceberg and considering a migration to an alternative Lakehouse platform, I’d recommend Delta Lake.

Get the most out of your lakehouse platform

Learn More
Contact Us

Share with your network

You may also enjoy

Classic vs. Serverless: Exploring Databricks’ latest Innovations

Explore the benefits of Databricks’ serverless solutions, which simplify resource management, improve productivity, and optimize costs. Discover key insights and best practices to enhance your data strategy with cutting-edge serverless technologies.

Help for FinOps Leaders – How the Lakehouse Optimizer can assist with your Lakehouse 

Discover how FinOps leaders manage cloud and data costs effectively while maximizing business value. Learn how the Lakehouse Optimizer (LHO) addresses common business problems through discovery, optimization, and operation.
Blueprint Technologies - Data information specialists

What we do

  • Generative AI
  • Video analytics
  • Application development
  • Cloud and infrastructure
  • Data platform modernization
  • Data governance
  • Data management
  • Data science and analytics
  • TCO Planning 
  • Productization
  • Future Proofing
  • Intelligent SOP
  • Lakehouse Optimization
  • Data Migrations
  • Generative AI
  • Video analytics
  • Application development
  • Cloud and infrastructure
  • Data platform modernization
  • Data governance
  • Data management
  • Data science and analytics
  • TCO Planning 
  • Productization
  • Future Proofing
  • Intelligent SOP
  • Lakehouse Optimization
  • Data Migrations

Industries

  • Manufacturing
  • Retail
  • Health & Life Sciences
  • Financial Services
  • Manufacturing
  • Retail
  • Health & Life Sciences
  • Financial Services

Databricks

  • Databricks Center of Excellence
  • QuickStart Offerings
  • Accelerated Data Migration
  • Accelerated Unity Catalog Migration
  • The Lakehouse Optimizer
  • Accelerated Snowflake to Databricks Migration
  • Databricks Center of Excellence
  • QuickStart Offerings
  • Accelerated Data Migration
  • Accelerated Unity Catalog Migration
  • The Lakehouse Optimizer
  • Accelerated Snowflake to Databricks Migration

About

  • Our approach
  • News
  • Events
  • Partners
  • Careers
  • Our approach
  • News
  • Events
  • Partners
  • Careers

Insights

Our work

Support

Contact us

Linkedin Youtube Facebook Instagram

© 2024 Blueprint Technologies, LLC.
2600 116th Avenue Northeast, First Floor
Bellevue, WA 98004

All rights reserved.

Media Kit

Employer Health Plan

Privacy Notice