close
close

Apre-salomemanzo

Breaking: Beyond Headlines!

Benefit from efficient data processing with Iceberg
aecifo

Benefit from efficient data processing with Iceberg

An iceberg is a piece of ice broken off from a glacier that floats freely in open water, and this definition also provides an appropriate name for the Iceberg Database Table Format Project.

Open source Apache Iceberg provides comprehensive database functionality in addition to cloud object stores. It illustrates how the separation of storage and computation in modern data stacks has enabled scalable, cost-effective computing and better interaction between different systems.

Two engineers at Netflix Inc. created Iceberg to overcome the challenges faced using existing data lake formats such as Apache Spark Or Apache hive. Engineers needed a solution to navigate the massive streaming media files stored on Amazon S3 by their employers.

“We had the same problems as everyone else, but ten times worse,” recalls Ryan Blueco-founder and CEO of Tabular Technologies Inc. and creator of Apache Iceberg. “Each query to (Amazon’s) S3 was not seven milliseconds, it was 70 milliseconds… so all the things you had to do very quickly to make sure your database isn’t lying to you, we couldn’t do that anymore very quickly. This problem therefore had to be resolved.

Blue spoke with George Gilbertsenior analyst at theCUBE Research, in the latest episode of The road to intelligent data applications, theCUBE’s ongoing conversation about the Sixth Data Platform, an emerging framework in which key vendors include Databricks Inc., Snowflake Inc., Amazon Web Services Inc., Microsoft Azure and Google LLC. He was joined by Dain SundstromCTO of Starburst Data Inc. and co-creator of Trino Inc. and Presto Automation Inc. They discussed the evolution and importance of separating storage from compute in modern data stacks.

Iceberg applied database fundamentals to a world of object stores

The problem Blue and his colleagues were trying to solve was that data lake formats like Hive and Spark were linked to main engines and the occasional sole supplier. By allowing any calculation engine to interact with common databases, users can use Iceberg to work with any analysis engine. Major technological players such as Netflix Inc.., Amazon Web Services, Snowflake Inc. And Databricks, Inc.. have widely adopted the open source tool.

“We had to go look at the Hive tables and say, ‘You know what? This model of tracking what is in our table is too simplistic; it won’t work in a world based on item stores,” Blue said. “What if we applied the fundamentals of databases? We designed according to the constraints we were working with.

For Sundstrom, the impetus for open source Trino The distributed query engine was born out of the need to replace Facebook Inc’s 300 petabyte Hive data warehouse. The goal was to enable fast, ad hoc analytical queries against big data file systems.

“(Hive) was a great way to allow less skilled engineers to interact with the huge data sets that Facebook had,” Sundstrom said. “The problem was, it sucked. So we decided to create a much more powerful distributed system using traditional database techniques.

As the commercial developer of the Trino-based distributed query engine, Starburst has taken steps in recent months to make it easier for organizations to build applications on data lake architectures. In November, the company released a set of new features which provided unified data ingestion, governance and sharing on a single platform.

“In general, we find ourselves faced with this problem: there is simply too much data to reasonably process; queries are too large and (customers) want to move to a more cost-effective solution,” Sundstrom said. “Often people get started with Starburst by simply plugging it in and exploring data from their existing platform because Trino supports federation.”

Starburst Launches Icehouse Data Lake for Data Ingestion

In April, Starburst announcement that it would release a fully managed Icehouse data lake on its cloud. Icehouse combines Trino and Iceberg storage to support near real-time data ingestion in a managed petabyte-scale Iceberg table.

“You can explore your data (and) play with it,” Sundstrom said. “When you want to achieve optimal performance, you export it. We recommend exporting to Iceberg. Of everything that exists today, this is the best in terms of data lake formats, in my opinion.

Iceberg’s modern table format for analytics continues to attract business interest. Google, Snowflake and Databricks have all support announced for Iceberg, according to Blue. During its earnings call in late February, Snowflake noted that customer adoption of Iceberg tables could creating “revenue headwinds” for the company.

“Iceberg has two things that other formats lack in some ways,” Blue said. “The first is a solid technical foundation. The other is this open community, where it’s owned and controlled by the Apache Software Foundation. We really wanted this project to be a foundational layer of data architecture, and we knew we needed a neutral community, a specification, to resolve any issues.

Here is the full conversation, part of the The road to intelligent data applications series:

Image: Getty Images-Matthias Kulka

Your vote of support is important to us and helps us keep content FREE.

A click below supports our mission of providing free, in-depth and relevant content.

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Andy Jassy, ​​CEO of Amazon.com, Michael Dell, Founder and CEO of Dell Technologies, Pat Gelsinger, CEO of Intel, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You are truly a part of our events and we really appreciate you coming and I know people also appreciate the content you create” – Andy Jassy

THANK YOU