Module 6: Geospatial Data Engineering for Pedometrics

Master coordinate system transformations, spatial interpolation methods, and uncertainty propagation in soil mapping. Build systems to handle irregular sampling, preferential sampling bias, and scale mismatches.

The course objective is to master the engineering principles required to transform raw, scattered soil observations into spatially continuous, analysis-ready datasets. This module focuses on building robust systems for handling coordinate transformations, advanced spatial interpolation, and rigorous uncertainty quantification, with a special emphasis on overcoming the real-world challenges of irregular sampling, preferential bias, and multi-scale data fusion.

This module is the spatial backbone of the Foundation Phase. It builds directly upon the multi-scale data architectures from Module 2 and the clean, point-based data generated in Modules 4 (Spectroscopy) and 5 (Metagenomics). The skills developed here are essential for creating the training data that will power landscape-scale foundation models like CarbonSequestrator and ErosionVulnerability, turning point data into predictive surfaces.

Hour 1-2: The Foundation: Coordinate Reference Systems (CRS) & Projections 🌍

Learning Objectives:

Understand the fundamental difference between geographic and projected coordinate systems.
Master the concepts of datums (e.g., WGS84, NAD83), ellipsoids, and projections (e.g., UTM, Albers Equal Area).
Build robust pipelines for identifying, validating, and transforming CRS in heterogeneous datasets.

Content:

Why CRS is the #1 Source of Error: How mismatched datums and projections can lead to spatial offsets of hundreds of meters, corrupting all downstream analysis.
The Anatomy of a CRS: Deconstructing EPSG codes and Well-Known Text (WKT) representations.
Choosing the Right Projection: Understanding the trade-offs between preserving area, distance, and shape for different soil mapping applications.
The Engineer's Toolkit: Using libraries like PROJ, GDAL/OGR, and Python's pyproj to build automated CRS transformation workflows.

Practical Exercise:

You are given three soil sample datasets for a single farm: one in geographic coordinates (lat/lon WGS84), one in UTM Zone 15N (NAD83), and one with an unknown CRS.
Write a Python script using geopandas and pyproj to:
1. Identify the CRS of each file.
2. Transform all datasets into a single, appropriate projected CRS.
3. Create a validation plot showing all three datasets correctly aligned on a single map.

Hour 3-4: Geostatistical Theory: Modeling Spatial Autocorrelation 📈

Learning Objectives:

Understand Tobler's First Law of Geography ("everything is related to everything else, but near things are more related than distant things").
Quantify spatial autocorrelation using the experimental variogram.
Model the variogram with mathematical functions to describe spatial structure.

Content:

From Points to Patterns: The core concept of a random field and how we model soil properties as spatially continuous variables.
The Variogram Cloud: Visualizing the relationship between sample separation distance and variance.
Modeling the Variogram: A deep dive into the three key parameters that describe spatial dependency:
- Nugget: Represents measurement error and micro-scale variability.
- Sill: The total variance in the data.
- Range: The distance beyond which samples are no longer spatially correlated.
Anisotropy: How to detect and model directional trends in spatial correlation (e.g., soil properties varying more along a slope than across it).

Hands-on Lab:

Using a dataset of soil organic carbon point samples, write a script with the Python library scikit-gstat to:
1. Calculate and plot the experimental variogram.
2. Fit spherical, exponential, and Gaussian models to the variogram.
3. Justify which model best represents the spatial structure of the data and interpret the nugget, sill, and range.

Hour 5-6: Spatial Interpolation I: Deterministic & Simple Approaches 🗺️

Learning Objectives:

Implement basic interpolation methods to understand the core concepts.
Understand the limitations and appropriate use cases for non-statistical interpolators.
Build a baseline model against which more advanced methods can be compared.

Content:

Inverse Distance Weighting (IDW): A simple, intuitive method where the influence of a sample point decreases with distance. We'll discuss the critical choice of the "power" parameter.
Thiessen (Voronoi) Polygons: A method that assigns the value of the nearest point to an entire area, creating a mosaic of polygons.
Splines: Fitting a smooth surface through the data points, useful for gently varying properties.
Why These Aren't Enough: A critical discussion of their major flaw: they don't provide a measure of prediction uncertainty.

Technical Workshop:

Using the same soil organic carbon dataset, create interpolated maps using IDW (with different power parameters) and Thiessen polygons.
Perform a leave-one-out cross-validation to compare the accuracy of the methods.
Critique the resulting maps, identifying artifacts and discussing their limitations.

Hour 7-8: Spatial Interpolation II: Kriging & Geostatistical Prediction ✨

Learning Objectives:

Understand the theory behind Kriging as the Best Linear Unbiased Estimator (BLUE).
Perform Ordinary Kriging to produce a map of predicted soil properties.
Generate a corresponding map of the kriging variance to quantify prediction uncertainty.

Content:

The Kriging Estimator: How it uses the modeled variogram to determine the optimal weights for surrounding samples to predict a value at an un-sampled location.
Ordinary Kriging (OK): The most common form, assuming a constant but unknown local mean.
The Power of Kriging: It's not just a map of predictions; it's also a map of confidence. The kriging variance is a direct output, showing where the predictions are reliable (near sample points) and where they are uncertain (far from data).
Block Kriging: How to predict the average value over an area (e.g., a 30x30m grid cell) instead of at a single point, which is crucial for matching scales with remote sensing data.

Kriging Implementation Lab:

Using the variogram model from Hour 3-4, implement Ordinary Kriging in Python using pykrige or gstools.
Generate two raster maps:
1. The predicted soil organic carbon map.
2. The kriging variance (uncertainty) map.
Analyze the relationship between the two maps and interpret the spatial patterns of uncertainty.

Hour 9-10: The Real World: Handling Sampling Bias & Irregularity 🚧

Learning Objectives:

Identify and visualize different types of sampling patterns (random, grid, clustered).
Understand how preferential sampling (e.g., sampling easily accessible areas) can bias interpolation results.
Implement methods to mitigate the effects of sampling bias.

Content:

The Problem of Convenience: Why soil sampling often follows roads, field edges, or known "problem areas," violating the assumptions of many statistical models.
Detecting Bias: Using statistical tests and visual analysis to compare the distribution of sample locations to the distribution of covariates (like elevation or slope).
Mitigation Strategies:
- Declustering: Weighting samples in dense clusters less heavily to approximate a more random sample distribution.
- Model-Based Approaches: Using covariates to explicitly model the trend in the data. Universal Kriging and Regression Kriging incorporate secondary information (e.g., satellite imagery, elevation models) to improve predictions and account for trends that may have guided sampling.

Practical Exercise:

Given a dataset of soil salinity samples known to be preferentially sampled in low-lying areas, first perform Ordinary Kriging and observe the biased result.
Then, implement Regression Kriging using an elevation model as a covariate.
Compare the two maps and the cross-validation statistics to demonstrate how incorporating the elevation data corrected the sampling bias.

Hour 11-12: Advanced Geostatistics & Uncertainty Propagation 🎲

Learning Objectives:

Move beyond a single "best" map to a probabilistic view of soil properties.
Implement Gaussian Geostatistical Simulation (SGS) to generate multiple equally probable maps (realizations).
Use the ensemble of realizations to calculate robust uncertainty metrics and probabilities.

Content:

Why Variance Isn't Enough: Kriging variance shows prediction error at a single point, but it doesn't capture the joint uncertainty across space (the "texture" of the spatial variability).
Sequential Gaussian Simulation (SGS): An algorithm that generates multiple maps, each one honoring the sample data and the variogram. The set of these "realizations" represents the full uncertainty.
Post-Processing Simulations: From an ensemble of 100+ realizations, you can calculate:
- The mean or median map (often more robust than a single kriging map).
- A variance map at every pixel.
- The probability of exceeding a critical threshold (e.g., "What is the probability that soil carbon is below 2%?").

Simulation Workshop:

Implement SGS to generate 100 realizations of the soil organic carbon map.
Write a script to process the stack of 100 output rasters to calculate and map:
1. The pixel-wise mean.
2. The pixel-wise standard deviation (a more robust uncertainty map).
3. The probability that carbon concentration exceeds a regulatory threshold.

Hour 13-14: Engineering for Scale Mismatches & Data Fusion 🧩

Learning Objectives:

Understand the Modifiable Areal Unit Problem (MAUP) in soil science.
Implement robust methods for upscaling and downscaling geospatial data.
Build a data fusion pipeline that combines point data with raster covariates at different resolutions.

Content:

The Scale Problem: You have point soil samples, a 10m elevation model, 30m satellite imagery, and a 4km climate grid. How do you combine them?
Upscaling (Points to Rasters): This is the interpolation we've been doing, but now we focus on Block Kriging to correctly predict the average value for a grid cell.
Downscaling (Rasters to Points/Finer Rasters): Using fine-scale covariates to disaggregate coarse-resolution data. This is key for creating high-resolution soil maps from global products like SoilGrids.
The Covariate Stack: The engineering practice of resampling all raster covariates to a single, standardized grid that serves as the basis for all modeling.

Data Fusion Sprint:

Create a standardized analysis grid (e.g., 30m resolution) for a study area.
Write a Python script using rasterio and gdal to:
1. Resample a 90m elevation model and a 1km climate raster to the 30m grid.
2. Extract the values of these covariates at your point sample locations.
3. Combine the point data and raster data into a single, analysis-ready GeoDataFrame.

Hour 15: Capstone: Building a Production Pedometric Mapping Pipeline 🏆

Final Challenge: You are tasked with creating the definitive, reproducible map of plant-available phosphorus for a small watershed to guide fertilizer recommendations. You are given a messy collection of data:

85 soil samples with phosphorus values, in a mix of CRS.
A 10m resolution Digital Elevation Model (DEM).
A 30m Landsat image showing vegetation patterns (NDVI).
Known preferential sampling along streams.

Your Pipeline Must:

Ingest & Clean: Harmonize all data into a single projected CRS.
Exploratory Analysis: Model the variogram for phosphorus and test for anisotropy.
Handle Bias: Use the DEM and NDVI as covariates in a Regression Kriging model to account for the preferential sampling.
Quantify Uncertainty: Use geostatistical simulation (conditioned on the regression model) to generate 100 realizations of the phosphorus map.
Deliver Actionable Intelligence: Produce three final maps:
- The best estimate (median) of plant-available phosphorus.
- A map of the 90% confidence interval width (a measure of uncertainty).
- A "management zone" map showing areas where there is a >80% probability that phosphorus is below the agronomic threshold.

Deliverables:

A fully documented, runnable script or Jupyter Notebook that performs the entire workflow from raw data to final maps.
The three final maps as GeoTIFF files.
A brief report justifying your choice of model (Regression Kriging), interpreting the uncertainty map, and explaining how the final probability map can be used by a farm manager.

Assessment Criteria:

Correctness of the geoprocessing and geostatistical workflow.
Robustness of the code and reproducibility of the results.
Clarity of justification for methodological choices.
Actionability and interpretation of the final uncertainty and probability maps.

Soil Quality Lab Foundation Models