Module 8: Version Control for Scientific Datasets
Implement Git-LFS, DVC, and specialized tools for versioning large scientific datasets. Handle incremental updates to soil surveys and maintain reproducibility across model iterations.
The course objective is to implement and manage robust version control systems specifically designed for large, complex scientific datasets and machine learning models. Students will master Git-LFS for handling large files and DVC (Data Version Control) for creating reproducible, end-to-end data pipelines. The course will focus on practical workflows for managing incremental updates to soil datasets and ensuring complete reproducibility across model training iterations.
This module is the lynchpin for ensuring reproducibility in the entire curriculum. It directly addresses the challenge of managing the large, heterogeneous data artifacts produced in Modules 4-7 (spectra, metagenomes, maps, time series). It provides the foundational engineering practice required for the iterative Model Development Phase (Modules 51-75) and the auditable, production-ready systems needed for the Deployment & Applications Phase (Modules 76-100), turning the ad-hoc scripts of previous modules into traceable, versioned pipelines.
Hour 1-2: The Reproducibility Crisis: Why git
Is Not Enough π¬
Learning Objectives:
- Understand why versioning data is fundamentally different and more complex than versioning code.
- Analyze the failure modes of using standard Git for large data files (e.g., repository bloat, performance collapse).
- Define the core principles of a reproducible scientific workflow: linking code, data, and outputs.
Content:
- The
final_data_v2_Johns_edit_final.csv
Problem: A critical look at the ad-hoc "versioning" practices common in science. - Git's Blind Spot: Git versions text. We'll explore how it handles binary files and why storing a 1GB GeoTIFF file in Git is a recipe for disaster.
- From Version Control to Provenance: Introducing the concept of a Directed Acyclic Graph (DAG) for a scientific workflow. We need to track not just the data, but the code that produced it.
- Case Study: Deconstructing a published paper where a minor, untracked change in a dataset led to incorrect conclusions, highlighting the critical need for these tools.
Practical Exercise:
- Initialize a standard Git repository.
- Attempt to commit a 150MB file (e.g., a sample raster from Module 6).
- Observe the warning messages and the inflation of the
.git
directory size. - Clone the repository to another location and note the slow transfer speed. This provides a tangible pain point that the rest of the module will solve.
Hour 3-4: A First Step: Git Large File Storage (Git-LFS) π
Learning Objectives:
- Understand the mechanics of Git-LFS: how it replaces large files with lightweight text pointers.
- Install and configure Git-LFS in a project.
- Track and manage large binary files without bloating the Git repository.
Content:
- The Pointer System: A conceptual walkthrough of how Git-LFS intercepts
git add
, checks if the file type should be tracked, and if so, uploads the file to a separate LFS store, leaving only a small pointer file in the Git history. - Installation and Setup:
git lfs install
. - Tracking Files: Using
git lfs track
to specify which file patterns (e.g.,*.tif
,*.h5
) should be handled by LFS. - The LFS Cache: Understanding where the actual large files are stored locally and remotely.
Hands-on Lab:
- Take the repository from the previous exercise.
- Install Git-LFS and configure it to track
*.tif
files. - Use
git rm
to unstage the large file, then re-add and commit it. - Inspect the file in the repositoryβit's now a small text pointer. Inspect the
.git/lfs
directory to see the actual stored object. - Push the repository to a remote (like GitHub) and observe the separate LFS upload process.
Hour 5-6: Beyond Files: Introducing DVC (Data Version Control) π
Learning Objectives:
- Understand the limitations of Git-LFS (it versions files, not pipelines or datasets).
- Grasp the core philosophy of DVC: using Git to version metadata while handling data in remote storage.
- Initialize a DVC project and configure a remote storage backend.
Content:
- The Missing Link: Git-LFS knows what your data is, but not how it was made. DVC is designed to version the entire pipeline.
- DVC's Architecture:
- Git: Versions small
.dvc
metadata files and your code. - DVC Cache: A content-addressable storage for data files locally.
- Remote Storage: Your S3, GCS, Azure Blob, or even SSH server where the actual data lives.
- Git: Versions small
- Setting Up:
dvc init
anddvc remote add
. We'll configure DVC to use a cloud storage backend.
Technical Workshop:
- Create a new project directory. Initialize both a Git and a DVC repository.
- Create a dummy 50MB data file (e.g.,
soil_samples.csv
). - Configure DVC to use a remote storage location (a local directory can simulate a cloud remote for this exercise).
- Use
dvc add
to start tracking the data file. - Observe the new
.dvc
file created.cat
this file to see that it's a small text file containing an MD5 hash and path. - Commit the
.dvc
file to Git. Usedvc push
to send the actual data to the remote storage.
Hour 7-8: Building Reproducible Pipelines with DVC βοΈ
Learning Objectives:
- Use
dvc run
to define and execute stages in a data pipeline. - Understand the structure and importance of the
dvc.yaml
file. - Reproduce a pipeline and see how DVC intelligently skips unchanged stages.
Content:
- Defining Stages: A pipeline stage consists of dependencies (data or code), outputs (new data), and a command to run.
dvc run
: The command that executes a script and creates a DVC stage, tracking its inputs and outputs.- The
dvc.yaml
file: DVC automatically generates this file, which defines the entire workflow DAG. This file is committed to Git and is the key to reproducibility. dvc repro
: The command to re-run the pipeline. DVC checks the hashes of all dependencies; if nothing has changed, it does nothing. If a piece of code or data changes, it re-runs only that stage and all downstream stages.
Pipeline Lab:
- Create a simple Python script
process.py
that takes an input CSV, filters it, and saves an output CSV. - Use
dvc run
to execute this script, defining the input CSV as a dependency and the output CSV as an output. - Inspect the generated
dvc.yaml
. - Run
dvc repro
. Observe that DVC reports the pipeline is up to date. - Now, modify the
process.py
script (e.g., change a filter threshold). - Run
dvc repro
again. Observe that DVC now re-executes the stage because the code dependency has changed.
Hour 9-10: Managing Evolving Datasets & Incremental Updates π
Learning Objectives:
- Develop a workflow for versioning datasets that receive periodic updates (e.g., new soil survey data).
- Understand how DVC's caching mechanism efficiently handles large datasets with small changes.
- Use
dvc get
anddvc import
to share and reuse versioned data across projects.
Content:
- The Soil Survey Problem: You have a 10GB dataset of soil samples. A new field campaign adds 50MB of new samples. How do you version this without duplicating the 10GB?
- DVC's Caching Magic: DVC's content-addressable cache means it only needs to store and upload the new data. The version metadata is updated, but the underlying storage is highly efficient.
- Workflow for Updates:
dvc pull
the existing data.- Add the new data files.
dvc add
the updated directory.git commit
the changed.dvc
file.dvc push
only the new data chunks.
- Sharing Data: Using
dvc get
to download a specific version of a dataset from another repository without cloning the whole project.
Practical Exercise:
- Start with a DVC-tracked directory containing several large files.
- Simulate an update by adding a new file to the directory.
- Run
dvc add
on the directory and observe the changes in the.dvc
file. - Use
dvc status -c
to see that only the new file will be pushed to the remote. - Push the changes and then use
git checkout HEAD~1
anddvc pull
to revert the dataset to its previous version.
Hour 11-12: Experiment Tracking for Model Iterations π
Learning Objectives:
- Integrate model training into a DVC pipeline.
- Use DVC to track model metrics and parameters.
- Compare the results of different model experiments using DVC commands.
Content:
- Versioning Models and Metrics: Extending the pipeline to include a training stage. The outputs are now the trained model file (
.pkl
,.h5
) and a metrics file (.json
). dvc exp run
: A powerful command that runs an experiment without creating a new Git commit for every run. It can be used to inject different parameters into your pipeline.dvc params diff
: Compare the hyperparameters (e.g., learning rate, tree depth) used in different experiments.dvc metrics diff
: Compare the resulting model performance metrics (e.g., accuracy, RMSE) side-by-side in your terminal.
ML Experiment Lab:
- Create a
train.py
script that loads processed data, trains a simple scikit-learn model, and saves the model and ametrics.json
file. - Define a
params.yaml
file to hold hyperparameters. - Add a training stage to your
dvc.yaml
that depends on the processed data and theparams.yaml
file. - Run an initial experiment:
dvc exp run
. - Change a hyperparameter in
params.yaml
. - Run a second experiment:
dvc exp run
. - Use
dvc exp show
to see a table comparing the parameters and metrics from both runs.
Hour 13-14: Advanced Workflows & Collaboration π€
Learning Objectives:
- Structure a DVC project for team collaboration.
- Understand how to use Git branches with DVC to work on data and models in parallel.
- Integrate DVC with CI/CD systems for automated model validation.
Content:
- DVC and Git Branching: The standard workflow:
git checkout -b new-feature
- Make changes to data or code.
dvc repro
ordvc exp run
.git commit
anddvc push
.- Open a Pull Request. The PR will show the changes to code, params, and the results (metrics).
- Introduction to CML (Continuous Machine Learning): An open-source library that extends CI/CD systems (like GitHub Actions) to work with DVC. It can automatically run your pipeline and post a report with performance metrics directly in a pull request.
- Data Registries: Using DVC as a lightweight data registry to provide versioned, discoverable datasets to an entire organization.
Collaboration Simulation:
- Work through a simulated pull request workflow. A teammate proposes a change to a data processing step.
- Review the PR, noting the changes in code and the
dvc.lock
file. - Use
dvc metrics diff
to compare the performance of the model on the main branch versus the feature branch before merging. - Set up a simple GitHub Action using CML that automatically runs
dvc repro
and posts a comment on a PR.
Hour 15: Capstone: Building a Fully Versioned Soil Prediction Workflow π
Final Challenge: You are given a complete but untracked soil modeling project. It contains raw data, a data processing script, a model training script, and configuration files. Your task is to bring this entire workflow under version control to ensure it is 100% reproducible.
The Project:
- Data: Raw soil sample CSVs and a GeoTIFF elevation model.
- Code:
process.py
(merges and cleans data),featurize.py
(extracts elevation for points),train.py
(trains a model). - Config:
params.yaml
for model hyperparameters.
Your Mission:
- Initialize: Set up Git, Git-LFS (for the GeoTIFF), and DVC with a remote.
- Version Data: Put the raw data under DVC control.
- Build the Pipeline: Create a multi-stage
dvc.yaml
file that defines the entire workflow:process
->featurize
->train
. - Run and Version: Execute the full pipeline with
dvc repro
and commit the results. Push everything (code to Git, data to DVC remote). - Iterate: You are asked to test a new hyperparameter. Use
dvc exp run
to launch the new experiment. - Report: Use
dvc exp show
to generate a comparison table of your experiments. Create a short markdown report explaining which experiment was better and why, and include the DVC table as proof.
Deliverables:
- A link to a Git repository containing the fully versioned project.
- The final markdown report comparing the model experiments.
- A short screencast or written walkthrough explaining how a collaborator could clone your repository, run
dvc pull
, and perfectly reproduce your final result withdvc repro
.
Assessment Criteria:
- Correct use of Git, Git-LFS, and DVC for their respective roles.
- A well-structured and functional
dvc.yaml
pipeline. - Successful execution and comparison of model experiments.
- The clarity and completeness of the reproducibility instructions, proving the system works.