Archive.fm

The Data Stack Show

185: The Evolution of Data Processing, Data Formats, and Data Sharing with Ryan Blue of Tabular

This week on The Data Stack Show, Eric and Kostas chat with Ryan Blue, the Co-Founder and CEO of Tabular, and also creator of Iceberg and former Cloudera and Netflix employee. During the episode, Ryan discusses the challenges of managing large-scale data and the development of Iceberg, a new table format. He explains Iceberg's benefits, such as automatic partitioning and improved metadata management, which simplify data engineers' tasks and enhance query performance. The conversation covers the importance of atomicity in analytics systems, the scalability of Iceberg, and the trade-offs in mixed workload environments. Additionally, Ryan addresses the differences in cloud object storage performance and the integration of security and access controls into distributed file systems. He also touches on recent Iceberg updates, including Python and Rust support, the anticipation of view support in the upcoming release, and more.

Broadcast on:
10 Apr 2024

Highlights from this week’s conversation include:

  • The Evolution of Data Processing (2:36)
  • Ryan’s Background and Journey in Data (4:52)
  • Challenges in Transitioning to S3 (8:47)
  • Impact of Latency on Query Performance (11:43)
  • Challenges with Table Representation (15:26)
  • Designing a New Metadata Format (21:36)
  • Integration with Existing Tools and Open Source Project (24:07)
  • Initial Features of Iceberg (26:11)
  • Challenges of Manual Partitioning (31:49)
  • Designing the Iceberg Table Format (37:31)
  • Trade-offs in Writing Workloads (47:22)
  • Database Systems and File Systems (55:00)
  • Vendor Influence on Access Controls (1:01:58)
  • Restructuring Data Security (1:03:39)
  • Delegating Access Controls (1:07:22)
  • Column-level Access Controls (1:14:19)
  • Exciting Releases and Future Plans (1:17:47)
  • Centralization of Components in Data Infrastructure (1:25:37)
  • Fundamental Shift in Data Architecture (1:28:28)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.