Module 8: Version Control for Scientific Datasets

Implement Git-LFS, DVC, and specialized tools for versioning large scientific datasets. Handle incremental updates to soil surveys and maintain reproducibility across model iterations.

The course objective is to implement and manage robust version control systems specifically designed for large, complex scientific datasets and machine learning models. Students will master Git-LFS for handling large files and DVC (Data Version Control) for creating reproducible, end-to-end data pipelines. The course will focus on practical workflows for managing incremental updates to soil datasets and ensuring complete reproducibility across model training iterations.

This module is the lynchpin for ensuring reproducibility in the entire curriculum. It directly addresses the challenge of managing the large, heterogeneous data artifacts produced in Modules 4-7 (spectra, metagenomes, maps, time series). It provides the foundational engineering practice required for the iterative Model Development Phase (Modules 51-75) and the auditable, production-ready systems needed for the Deployment & Applications Phase (Modules 76-100), turning the ad-hoc scripts of previous modules into traceable, versioned pipelines.


Hour 1-2: The Reproducibility Crisis: Why git Is Not Enough πŸ”¬

Learning Objectives:

  • Understand why versioning data is fundamentally different and more complex than versioning code.
  • Analyze the failure modes of using standard Git for large data files (e.g., repository bloat, performance collapse).
  • Define the core principles of a reproducible scientific workflow: linking code, data, and outputs.

Content:

  • The final_data_v2_Johns_edit_final.csv Problem: A critical look at the ad-hoc "versioning" practices common in science.
  • Git's Blind Spot: Git versions text. We'll explore how it handles binary files and why storing a 1GB GeoTIFF file in Git is a recipe for disaster.
  • From Version Control to Provenance: Introducing the concept of a Directed Acyclic Graph (DAG) for a scientific workflow. We need to track not just the data, but the code that produced it.
  • Case Study: Deconstructing a published paper where a minor, untracked change in a dataset led to incorrect conclusions, highlighting the critical need for these tools.

Practical Exercise:

  • Initialize a standard Git repository.
  • Attempt to commit a 150MB file (e.g., a sample raster from Module 6).
  • Observe the warning messages and the inflation of the .git directory size.
  • Clone the repository to another location and note the slow transfer speed. This provides a tangible pain point that the rest of the module will solve.

Hour 3-4: A First Step: Git Large File Storage (Git-LFS) πŸ“‚

Learning Objectives:

  • Understand the mechanics of Git-LFS: how it replaces large files with lightweight text pointers.
  • Install and configure Git-LFS in a project.
  • Track and manage large binary files without bloating the Git repository.

Content:

  • The Pointer System: A conceptual walkthrough of how Git-LFS intercepts git add, checks if the file type should be tracked, and if so, uploads the file to a separate LFS store, leaving only a small pointer file in the Git history.
  • Installation and Setup: git lfs install.
  • Tracking Files: Using git lfs track to specify which file patterns (e.g., *.tif, *.h5) should be handled by LFS.
  • The LFS Cache: Understanding where the actual large files are stored locally and remotely.

Hands-on Lab:

  • Take the repository from the previous exercise.
  • Install Git-LFS and configure it to track *.tif files.
  • Use git rm to unstage the large file, then re-add and commit it.
  • Inspect the file in the repositoryβ€”it's now a small text pointer. Inspect the .git/lfs directory to see the actual stored object.
  • Push the repository to a remote (like GitHub) and observe the separate LFS upload process.

Hour 5-6: Beyond Files: Introducing DVC (Data Version Control) πŸ”—

Learning Objectives:

  • Understand the limitations of Git-LFS (it versions files, not pipelines or datasets).
  • Grasp the core philosophy of DVC: using Git to version metadata while handling data in remote storage.
  • Initialize a DVC project and configure a remote storage backend.

Content:

  • The Missing Link: Git-LFS knows what your data is, but not how it was made. DVC is designed to version the entire pipeline.
  • DVC's Architecture:
    • Git: Versions small .dvc metadata files and your code.
    • DVC Cache: A content-addressable storage for data files locally.
    • Remote Storage: Your S3, GCS, Azure Blob, or even SSH server where the actual data lives.
  • Setting Up: dvc init and dvc remote add. We'll configure DVC to use a cloud storage backend.

Technical Workshop:

  • Create a new project directory. Initialize both a Git and a DVC repository.
  • Create a dummy 50MB data file (e.g., soil_samples.csv).
  • Configure DVC to use a remote storage location (a local directory can simulate a cloud remote for this exercise).
  • Use dvc add to start tracking the data file.
  • Observe the new .dvc file created. cat this file to see that it's a small text file containing an MD5 hash and path.
  • Commit the .dvc file to Git. Use dvc push to send the actual data to the remote storage.

Hour 7-8: Building Reproducible Pipelines with DVC ⛓️

Learning Objectives:

  • Use dvc run to define and execute stages in a data pipeline.
  • Understand the structure and importance of the dvc.yaml file.
  • Reproduce a pipeline and see how DVC intelligently skips unchanged stages.

Content:

  • Defining Stages: A pipeline stage consists of dependencies (data or code), outputs (new data), and a command to run.
  • dvc run: The command that executes a script and creates a DVC stage, tracking its inputs and outputs.
  • The dvc.yaml file: DVC automatically generates this file, which defines the entire workflow DAG. This file is committed to Git and is the key to reproducibility.
  • dvc repro: The command to re-run the pipeline. DVC checks the hashes of all dependencies; if nothing has changed, it does nothing. If a piece of code or data changes, it re-runs only that stage and all downstream stages.

Pipeline Lab:

  • Create a simple Python script process.py that takes an input CSV, filters it, and saves an output CSV.
  • Use dvc run to execute this script, defining the input CSV as a dependency and the output CSV as an output.
  • Inspect the generated dvc.yaml.
  • Run dvc repro. Observe that DVC reports the pipeline is up to date.
  • Now, modify the process.py script (e.g., change a filter threshold).
  • Run dvc repro again. Observe that DVC now re-executes the stage because the code dependency has changed.

Hour 9-10: Managing Evolving Datasets & Incremental Updates πŸ”„

Learning Objectives:

  • Develop a workflow for versioning datasets that receive periodic updates (e.g., new soil survey data).
  • Understand how DVC's caching mechanism efficiently handles large datasets with small changes.
  • Use dvc get and dvc import to share and reuse versioned data across projects.

Content:

  • The Soil Survey Problem: You have a 10GB dataset of soil samples. A new field campaign adds 50MB of new samples. How do you version this without duplicating the 10GB?
  • DVC's Caching Magic: DVC's content-addressable cache means it only needs to store and upload the new data. The version metadata is updated, but the underlying storage is highly efficient.
  • Workflow for Updates:
    1. dvc pull the existing data.
    2. Add the new data files.
    3. dvc add the updated directory.
    4. git commit the changed .dvc file.
    5. dvc push only the new data chunks.
  • Sharing Data: Using dvc get to download a specific version of a dataset from another repository without cloning the whole project.

Practical Exercise:

  • Start with a DVC-tracked directory containing several large files.
  • Simulate an update by adding a new file to the directory.
  • Run dvc add on the directory and observe the changes in the .dvc file.
  • Use dvc status -c to see that only the new file will be pushed to the remote.
  • Push the changes and then use git checkout HEAD~1 and dvc pull to revert the dataset to its previous version.

Hour 11-12: Experiment Tracking for Model Iterations πŸ“Š

Learning Objectives:

  • Integrate model training into a DVC pipeline.
  • Use DVC to track model metrics and parameters.
  • Compare the results of different model experiments using DVC commands.

Content:

  • Versioning Models and Metrics: Extending the pipeline to include a training stage. The outputs are now the trained model file (.pkl, .h5) and a metrics file (.json).
  • dvc exp run: A powerful command that runs an experiment without creating a new Git commit for every run. It can be used to inject different parameters into your pipeline.
  • dvc params diff: Compare the hyperparameters (e.g., learning rate, tree depth) used in different experiments.
  • dvc metrics diff: Compare the resulting model performance metrics (e.g., accuracy, RMSE) side-by-side in your terminal.

ML Experiment Lab:

  • Create a train.py script that loads processed data, trains a simple scikit-learn model, and saves the model and a metrics.json file.
  • Define a params.yaml file to hold hyperparameters.
  • Add a training stage to your dvc.yaml that depends on the processed data and the params.yaml file.
  • Run an initial experiment: dvc exp run.
  • Change a hyperparameter in params.yaml.
  • Run a second experiment: dvc exp run.
  • Use dvc exp show to see a table comparing the parameters and metrics from both runs.

Hour 13-14: Advanced Workflows & Collaboration 🀝

Learning Objectives:

  • Structure a DVC project for team collaboration.
  • Understand how to use Git branches with DVC to work on data and models in parallel.
  • Integrate DVC with CI/CD systems for automated model validation.

Content:

  • DVC and Git Branching: The standard workflow:
    1. git checkout -b new-feature
    2. Make changes to data or code.
    3. dvc repro or dvc exp run.
    4. git commit and dvc push.
    5. Open a Pull Request. The PR will show the changes to code, params, and the results (metrics).
  • Introduction to CML (Continuous Machine Learning): An open-source library that extends CI/CD systems (like GitHub Actions) to work with DVC. It can automatically run your pipeline and post a report with performance metrics directly in a pull request.
  • Data Registries: Using DVC as a lightweight data registry to provide versioned, discoverable datasets to an entire organization.

Collaboration Simulation:

  • Work through a simulated pull request workflow. A teammate proposes a change to a data processing step.
  • Review the PR, noting the changes in code and the dvc.lock file.
  • Use dvc metrics diff to compare the performance of the model on the main branch versus the feature branch before merging.
  • Set up a simple GitHub Action using CML that automatically runs dvc repro and posts a comment on a PR.

Hour 15: Capstone: Building a Fully Versioned Soil Prediction Workflow πŸ†

Final Challenge: You are given a complete but untracked soil modeling project. It contains raw data, a data processing script, a model training script, and configuration files. Your task is to bring this entire workflow under version control to ensure it is 100% reproducible.

The Project:

  • Data: Raw soil sample CSVs and a GeoTIFF elevation model.
  • Code: process.py (merges and cleans data), featurize.py (extracts elevation for points), train.py (trains a model).
  • Config: params.yaml for model hyperparameters.

Your Mission:

  1. Initialize: Set up Git, Git-LFS (for the GeoTIFF), and DVC with a remote.
  2. Version Data: Put the raw data under DVC control.
  3. Build the Pipeline: Create a multi-stage dvc.yaml file that defines the entire workflow: process -> featurize -> train.
  4. Run and Version: Execute the full pipeline with dvc repro and commit the results. Push everything (code to Git, data to DVC remote).
  5. Iterate: You are asked to test a new hyperparameter. Use dvc exp run to launch the new experiment.
  6. Report: Use dvc exp show to generate a comparison table of your experiments. Create a short markdown report explaining which experiment was better and why, and include the DVC table as proof.

Deliverables:

  • A link to a Git repository containing the fully versioned project.
  • The final markdown report comparing the model experiments.
  • A short screencast or written walkthrough explaining how a collaborator could clone your repository, run dvc pull, and perfectly reproduce your final result with dvc repro.

Assessment Criteria:

  • Correct use of Git, Git-LFS, and DVC for their respective roles.
  • A well-structured and functional dvc.yaml pipeline.
  • Successful execution and comparison of model experiments.
  • The clarity and completeness of the reproducibility instructions, proving the system works.