Module 15: Data Lake Design for Multimodal Soil Information

Implement Apache Iceberg or Delta Lake for managing petabyte-scale soil data with ACID transactions. Optimize for both batch training and real-time inference workloads.

The course objective is to design and implement a modern data lakehouse capable of managing petabyte-scale, multimodal soil information with the reliability of a traditional data warehouse. Students will master open table formats like Apache Iceberg to provide ACID transactions, schema evolution, and time travel capabilities on top of cloud object storage. The course will focus on building a unified architecture optimized for both large-scale batch model training and low-latency, real-time inference workloads.

Context: This module is the capstone of the data engineering portion of the Foundation Phase. It provides the central storage architecture that the Kubernetes compute clusters from Module 14 will rely on. This is where the "Global Soil Data Commons" transitions from a concept to a concrete implementation. The reliable, scalable, and queryable data lake built here will serve as the single source of truth for all subsequent modeling, analysis, and application development in the curriculum.

Hour 1-2: The Data Swamp and the Rise of the Lakehouse 🐊

Learning Objectives:

Understand the limitations of a traditional data lake and why they often devolve into "data swamps."
Grasp the "Lakehouse" paradigm: combining the low-cost scalability of a data lake with the reliability and performance of a data warehouse.
Learn how open table formats like Apache Iceberg and Delta Lake enable this paradigm.

Content:

The Problem with "Just a Bunch of Files": A classic data lake (e.g., folders of Parquet files in Amazon S3) suffers from critical flaws:
- No ACID Transactions: A failed write job can leave the data in a corrupted, inconsistent state.
- No Schema Enforcement: Different jobs can write data with different schemas, leading to chaos.
- Slow Performance: Listing millions of files in object storage is incredibly slow.
The Lakehouse Solution: We'll introduce open table formats (Iceberg/Delta) as a metadata layer that sits on top of open file formats (Parquet/ORC) in open cloud storage. This brings database-like features to the data lake.
Key Features that Fix the Swamp:
- ACID Transactions: Guarantee data integrity and consistency.
- Schema Evolution: Safely change a table's schema without rewriting all the data.
- Time Travel: Query the exact state of your data at a previous point in time, ensuring reproducibility.

Practical Exercise:

Using Apache Spark, write a script that attempts to write a large Parquet dataset to a directory.
Manually kill the job halfway through.
Observe the corrupted output: a mix of temporary files and partial data that makes the entire dataset unusable. This demonstrates the problem that table formats solve.

Hour 3-4: Apache Iceberg: A Deep Dive into the Architecture 🧊

Learning Objectives:

Understand the multi-layer metadata architecture of an Apache Iceberg table.
Create, write to, and read from your first Iceberg table using Apache Spark.
Demonstrate Iceberg's transactional guarantees.

Content:

How Iceberg Works: A conceptual walkthrough of the three layers of metadata that make Iceberg powerful:
1. Metadata File: A pointer to the current state of the table.
2. Manifest List: A list of all manifest files that make up a snapshot of the table.
3. Manifest Files: A list of the actual data files (.parquet), along with statistics about the data within them (min/max values, null counts).
Atomic Operations: An update to an Iceberg table is a simple, atomic swap of one metadata file pointer for another. This is how ACID transactions are achieved.
The Catalog: Where the pointer to the current metadata file is stored (e.g., AWS Glue, Hive Metastore, or even just HDFS).

Hands-on Lab:

Take the failed Parquet write job from the previous lab.
Now, write the same data to a new Iceberg table using Spark.
Again, kill the job halfway through.
Show that the Iceberg table is completely unaffected and remains in its previous valid state. Read the table to prove its consistency. This is a direct demonstration of ACID transactions on a data lake.

Hour 5-6: Schema Evolution & Time Travel: The Pillars of Reproducibility ⏳

Learning Objectives:

Use Iceberg's schema evolution capabilities to add, drop, and rename columns without rewriting data.
Use "time travel" queries to access previous versions of a table for reproducibility and auditing.
Understand how these features support long-term data management and agile development.

Content:

The Ever-Changing Schema: In soil science, our understanding and measurement capabilities evolve. A new sensor is added, a new lab method is adopted. Your data tables must be able to adapt gracefully.
Safe Schema Evolution: Unlike traditional systems, Iceberg handles schema changes with simple, fast metadata operations. You can add a column without affecting historical data or queries.
The Ultimate Undo Button: Every change to an Iceberg table creates a new, versioned snapshot. This allows for powerful "time travel" queries:
- SELECT * FROM soil_table VERSION AS OF '...'
- SELECT * FROM soil_table TIMESTAMP AS OF '...'
Use Case: This is a killer feature for machine learning. You can pin a model version to an Iceberg table version, guaranteeing you can always reproduce the exact data the model was trained on.

Technical Workshop:

Using Spark, perform the following operations on an Iceberg table:
1. Add a new column (nitrate_ppm).
2. Rename an existing column.
3. Run a query to show the current schema.
4. Run a time travel query using the snapshot ID from before the schema change to show the data in its original form.

Hour 7-8: Performance Tuning: Partitioning, Compaction, and Z-Ordering 🚀

Learning Objectives:

Implement Iceberg's "hidden partitioning" to dramatically speed up queries.
Run maintenance jobs to compact small files into larger, more efficient ones.
Apply Z-ordering to optimize queries with multi-column predicates.

Content:

The "Small File Problem": Ingesting streaming data often creates thousands of small files, which is highly inefficient for query engines.
Hidden Partitioning: A major Iceberg innovation. You define a partition based on a raw column (e.g., event_timestamp), and Iceberg automatically creates human-readable partitions behind the scenes (e.g., /year=2025/month=08/). Your users query by the timestamp, and Iceberg handles the partition pruning automatically.
Table Maintenance:
- Compaction: Running an OPTIMIZE job to combine small files into larger ones.
- Z-Ordering: A technique that physically co-locates related data across multiple dimensions, dramatically speeding up queries with multiple WHERE clauses (e.g., WHERE region = 'midwest' AND soil_type = 'mollisol').

Optimization Lab:

Create a large (simulated) Iceberg table of sensor readings with a timestamp and sensor_id column.
Create the table with hidden partitioning on the timestamp column (e.g., PARTITIONED BY days(timestamp)).
Run a query with a time filter (e.g., WHERE timestamp > '...-01-01') and examine the Spark UI to see how many files were scanned (partition pruning).
Now run a compaction job and verify that the number of data files has decreased.

Hour 9-10: Unifying Batch & Streaming in the Lakehouse 🔄

Learning Objectives:

Design a single architecture that serves both batch ETL and real-time streaming data.
Implement a Spark Structured Streaming job that writes a Kafka stream into an Iceberg table.
Understand how this architecture supports real-time inference workloads.

Content:

The Lambda Architecture is Dead: We no longer need separate, complex systems for batch and real-time. The Lakehouse can handle both.
Streaming Ingestion: Using Spark Structured Streaming or Apache Flink, we can read directly from the Kafka topics we designed in Module 11 and write to an Iceberg table.
Upserts and CDC: Iceberg supports MERGE INTO operations, allowing you to efficiently handle updates and deletes from your streams (Change Data Capture).
Serving Fresh Data: Because Iceberg updates are atomic, a machine learning model performing real-time inference can continuously query the same table that the streaming job is writing to, always getting the latest consistent snapshot of the data.

Streaming Lab:

Using Docker, set up Kafka and Spark.
Reuse the Kafka producer from Module 11 to generate a stream of sensor data.
Write a Spark Structured Streaming application that reads from the Kafka topic and writes the data to an Iceberg table using a 1-minute trigger.
While the stream is running, open a separate Spark shell and run batch queries on the Iceberg table, observing that new data appears every minute.

Hour 11-12: Managing Multimodal Data: Beyond the Single Table 🗺️🧬

Learning Objectives:

Design a data lake structure that can manage tabular, geospatial, genomic, and unstructured data.
Understand how to use Iceberg as a metadata catalog for non-tabular data formats.
Implement a solution using GeoParquet within an Iceberg-managed data lake.

Content:

The Multimodal Challenge: Soil data is diverse. We have tabular sensor readings, geospatial vector data, satellite imagery, and metagenomic sequences.
A Unified Catalog Approach: We use Iceberg to manage the primary, structured metadata, which can then point to data stored in other specialized formats.
The Architecture:
- Tabular (Lab, Sensor): Store directly in Iceberg tables with Parquet file format.
- Geospatial (Vector): Store the vector data as GeoParquet files in the data lake. Create an Iceberg table that catalogs these files, perhaps with summary statistics and a URI to the file's location.
- Unstructured (Images, Notebooks): Store the raw files (e.g., .jpg, .pdf) in object storage. Create an Iceberg table that acts as a searchable index with metadata and a URI to each file.

Design Exercise:

Design the schemas for a set of three interconnected Iceberg tables for a comprehensive soil survey:
1. samples: Core lab analysis results (tabular).
2. pedon_descriptions: Metadata about scanned field notebooks, with a URI to the PDF file.
3. sample_locations: A table where each row corresponds to a sample and contains a URI to a GeoParquet file holding the detailed site boundary polygon.

Hour 13-14: Governance: The Data Catalog & Access Control 🏛️

Learning Objectives:

Understand the role of a central data catalog in managing a large-scale data lake.
Configure Spark and Iceberg to use a catalog like the AWS Glue Data Catalog.
Discuss strategies for implementing data security and access control in the lakehouse.

Content:

The Card Catalog for Your Data Lake: Without a central catalog, your data lake is just a collection of files that no one can find or trust.
The Catalog's Job: It stores the authoritative mapping from a table name (e.g., prod.soil_sensors) to the location of its current Iceberg metadata file.
Popular Catalogs: Hive Metastore, AWS Glue, Project Nessie (which adds Git-like semantics).
Securing the Lake: Integrating with tools like Apache Ranger or cloud IAM policies to define fine-grained permissions: "This user can read the soil_sensors table, but only for region=iowa and cannot see the sample_provider_id column."

Governance Lab:

Using Docker, set up a local Hive Metastore service.
Configure your Spark environment to use this Hive Metastore as its catalog.
Create a new Iceberg table.
Use a database tool (like DBeaver) or the Spark Catalog API to show that the table is now registered in the central catalog and is discoverable.

Hour 15: Capstone: Building the Soil Data Commons Lakehouse 🏆

Final Challenge: You are the lead data architect for the "Global Soil Data Commons" project. Your task is to build a proof-of-concept data lakehouse on your local machine that demonstrates the key capabilities required for this global-scale, multi-user platform.

Your Mission:

Provision the Infrastructure: Using docker-compose, create a complete, self-contained environment with Spark, MinIO (for S3-compatible object storage), Kafka, and a Hive Metastore.
Design and Create the Core Table: Create a multimodal, partitioned Iceberg table named global_soil_data. It must be partitioned by country and year and contain columns for lab measurements plus a URI column for associated raw data files (e.g., spectra).
Unify Batch and Streaming Ingestion:
- Write a Spark job to perform a bulk load of a large historical CSV dataset into the table.
- Write a Spark Structured Streaming job that ingests real-time data from a Kafka topic and merges it into the same table.
Demonstrate Advanced Features for a Global Audience:
- Time Travel: A new partner provides corrected data for a past batch load. Use Iceberg's capabilities to replace a specific historical partition without taking the system offline. Then, run a query to show the data before and after the correction.
- Schema Evolution: The consortium agrees to add a new, standardized soil health metric. Evolve the table schema to add the new column while the streaming ingest is running.
- Performance: Run a maintenance job to compact the small, streaming-ingested files to ensure query performance for other users.

Deliverables:

A Git repository containing the docker-compose file and all Spark scripts needed to build and operate the lakehouse.
A Jupyter Notebook that acts as a user's guide, containing the queries that demonstrate the successful batch/stream unification, the data correction via time travel, and the live schema evolution.
A final architecture diagram and a short report explaining how your Lakehouse design addresses the core challenges of data reliability, scalability, and reproducibility required by the Soil Data Commons.

Assessment Criteria:

The correctness and robustness of the containerized infrastructure.
The successful implementation of both batch and streaming ingestion into a single Iceberg table.
The clear and effective demonstration of Iceberg's advanced features (ACID, time travel, schema evolution).
The quality of the documentation and the strategic vision articulated in the final report.

Soil Quality Lab Foundation Models