Spatial Plot Sampling Design for Forest Inventory in Python

A district forester needs to estimate standing carbon across 8,400 hectares of mixed conifer before a harvest authorization deadline, with a crew of four and three weeks of field time. The number of plots is fixed by budget; the placement of those plots decides whether the resulting biomass estimate has a 5% standard error or a 20% one. Spatial plot sampling design is the discipline that turns a fixed field budget into the most informative set of plot coordinates possible — minimizing estimator variance while respecting access constraints, ecological gradients, and jurisdictional boundaries. This page implements that design as a reproducible Python pipeline, building directly on the conventions established in Ecological GIS Data Foundations in Python so that every coordinate it emits is projection-correct, topologically clean, and traceable from boundary input to field GPX export.

The goal is not “random points in a polygon.” It is a design where each plot’s inclusion probability is known, where allocation follows the variance structure of the landscape, and where the whole process re-runs deterministically from a seed for audit and replication.

Prerequisites

Confirm your environment and inputs before generating a single coordinate. Mismatches here are the single largest source of silently biased designs.

Python 3.10+ with geopandas >= 0.14, shapely >= 2.0, numpy >= 1.24, and rasterio >= 1.3 installed (Shapely 2.0 vectorized predicates are assumed throughout).
Study-area boundary as a single-feature polygon or multipolygon with a valid, explicit CRS — no EPSG:4326 for area or distance work.
A projected, area-preserving or locally conformal CRS chosen for the region (e.g. UTM zone, Albers Equal Area, or a State Plane zone). See Coordinate Reference Systems for Forestry for selection criteria.
Stratification layer(s): either a categorical raster (canopy cover class, ecological zone) or a polygon layer carrying a stratum attribute.
A fixed total plot budget n and, if using Neyman allocation, a per-stratum variance estimate (a prior cruise, a pilot, or a covariate proxy).
A target field GPS format (GPX for Garmin, GeoPackage for QGIS, CSV for ArcGIS Field Maps) and its required CRS — most handheld units expect EPSG:4326.

The Sampling Principle: Why Stratify and How to Allocate

Simple random sampling places every plot with equal probability everywhere. That is unbiased, but for a heterogeneous forest it is wasteful: a dense plantation and a sparse riparian fringe get plots in proportion to their area, not their contribution to total variance. Stratification partitions the landscape into internally homogeneous zones and samples each independently, then recombines stratum estimates with area weights.

For a population split into $L$ strata, the variance of the stratified mean is

Var (\overset{y}{ˉ}_{s t}) = h = 1 \sum L W_{h}^{2} \frac{S _{h}^{2}}{n _{h}}, W_{h} = \frac{A _{h}}{A}

where $W_{h}$ is stratum $h$ ’s area weight, $S_{h}^{2}$ its within-stratum variance, and $n_{h}$ its plot count. Two allocation rules distribute the fixed budget $n = \sum_{h} n_{h}$ :

Proportional allocation sets $n_{h} = n W_{h}$ — plots follow area. It is simple and always beats simple random sampling when strata differ in mean.

Neyman (optimal) allocation sets

n_{h} = n \cdot \frac{W _{h} S _{h}}{\sum _{k = 1}^{L} W _{k} S _{k}}

pushing more plots into large, high-variance strata. Neyman minimizes $Var (\overset{y}{ˉ}_{s t})$ for a fixed $n$ , but it requires a defensible estimate of $S_{h}$ . When prior variance is unknown, proportional allocation is the honest default. The per-stratum mechanics and a worked Neyman example live in Stratified random sampling for forest plots.

Step-by-Step Python Pipeline

The pipeline runs in four stages: validate and reproject the boundary, derive strata, allocate the budget, then sample coordinates with spatial constraints. Each stage is a pure function so the whole design is deterministic from inputs plus a seed.

Step 1 — Validate and reproject the boundary

Topology errors and a wrong CRS corrupt every later area calculation, so repair geometry and force the target projection up front. The inward (negative) buffer keeps plot centers off the boundary edge, where access and edge effects are problematic.

import geopandas as gpd
from shapely.validation import make_valid

def prepare_boundary(
    boundary_gdf: gpd.GeoDataFrame,
    target_crs: str,
    edge_buffer_m: float = 50.0,
) -> gpd.GeoDataFrame:
    """Repair topology, reproject, and apply an inward edge buffer."""
    if boundary_gdf.crs is None:
        raise ValueError("Boundary has no CRS; assign one before sampling.")

    # Repair self-intersections, ring orientation, and slivers.
    if not boundary_gdf.is_valid.all():
        boundary_gdf = boundary_gdf.copy()
        boundary_gdf["geometry"] = boundary_gdf.geometry.apply(make_valid)

    boundary_gdf = boundary_gdf.to_crs(target_crs)

    # Dissolve to one frame, then buffer inward so plots stay navigable.
    extent = boundary_gdf.union_all().buffer(-edge_buffer_m)
    if extent.is_empty:
        raise ValueError("Edge buffer consumed the polygon; reduce edge_buffer_m.")

    return gpd.GeoDataFrame(geometry=[extent], crs=target_crs)

A negative buffer that exceeds the half-width of any narrow neck will erase that part of the sampling frame — the explicit is_empty check turns a silent gap into a loud failure.

Step 2 — Derive strata and their areas

Strata can arrive as polygons or be derived from a categorical raster (canopy cover class, NDVI bins, ecological zone). When deriving from a raster, polygonize within the prepared extent so stratum boundaries clip exactly to the sampling frame. Establishing consistent extents here also prevents misalignment when these same zones later feed Vegetation Index Calculation in Python or any raster–vector overlay step.

import numpy as np
import rasterio
from rasterio.features import shapes
from shapely.geometry import shape

def strata_from_raster(
    raster_path: str,
    extent: gpd.GeoDataFrame,
) -> gpd.GeoDataFrame:
    """Polygonize a categorical stratum raster, clipped to the sampling extent."""
    with rasterio.open(raster_path) as src:
        band = src.read(1)
        mask = band != src.nodata if src.nodata is not None else None
        geoms = [
            {"geometry": shape(geom), "stratum": int(val)}
            for geom, val in shapes(band, mask=mask, transform=src.transform)
        ]
        raster_crs = src.crs

    polys = gpd.GeoDataFrame.from_features(geoms, crs=raster_crs).to_crs(extent.crs)
    # Dissolve fragments of the same class, then clip to the frame.
    polys = polys.dissolve(by="stratum", as_index=False)
    clipped = gpd.clip(polys, extent)
    clipped["area_m2"] = clipped.geometry.area
    return clipped[clipped["area_m2"] > 0].reset_index(drop=True)

Step 3 — Allocate the plot budget

Allocation converts stratum areas (and optional variances) into integer plot counts that sum exactly to the budget. The largest-remainder method distributes rounding drift fairly instead of dumping it on the last stratum.

def allocate_plots(
    strata: gpd.GeoDataFrame,
    total_plots: int,
    variances: np.ndarray | None = None,
) -> np.ndarray:
    """Proportional (variances=None) or Neyman allocation with largest-remainder rounding."""
    area = strata["area_m2"].to_numpy(dtype=float)
    weight = area / area.sum()

    if variances is None:
        share = weight                                  # proportional
    else:
        std = np.sqrt(np.asarray(variances, dtype=float))
        share = (weight * std) / (weight * std).sum()   # Neyman / optimal

    raw = share * total_plots
    floor = np.floor(raw).astype(int)
    remainder = total_plots - floor.sum()
    # Hand the leftover plots to the strata with the largest fractional parts.
    order = np.argsort(-(raw - floor))
    floor[order[:remainder]] += 1
    return floor

Step 4 — Sample constrained coordinates

Rejection sampling within each stratum’s bounding box guarantees points land inside both the stratum and the buffered frame. A minimum inter-plot spacing prevents pseudo-replication where two plots share a tree neighborhood.

from shapely.geometry import Point
from shapely.strtree import STRtree

def sample_points(
    strata: gpd.GeoDataFrame,
    extent: gpd.GeoDataFrame,
    allocation: np.ndarray,
    min_spacing_m: float = 0.0,
    seed: int = 42,
    max_tries_factor: int = 400,
) -> gpd.GeoDataFrame:
    """Rejection-sample plot centers per stratum under a minimum-spacing rule."""
    rng = np.random.default_rng(seed)
    frame = extent.geometry.iloc[0]
    records: list[dict] = []
    accepted: list[Point] = []

    for idx, n_plots in enumerate(allocation):
        if n_plots == 0:
            continue
        geom = strata.geometry.iloc[idx]
        stratum_id = strata["stratum"].iloc[idx] if "stratum" in strata else idx
        minx, miny, maxx, maxy = geom.bounds
        placed, tries, cap = 0, 0, n_plots * max_tries_factor

        while placed < n_plots and tries < cap:
            tries += 1
            pt = Point(rng.uniform(minx, maxx), rng.uniform(miny, maxy))
            if not (geom.contains(pt) and frame.contains(pt)):
                continue
            if min_spacing_m > 0 and accepted:
                tree = STRtree(accepted)
                near = tree.query(pt.buffer(min_spacing_m))
                if any(accepted[j].distance(pt) < min_spacing_m for j in near):
                    continue
            accepted.append(pt)
            records.append({"plot_id": len(records) + 1,
                            "stratum": stratum_id, "geometry": pt})
            placed += 1

        if placed < n_plots:
            raise RuntimeError(
                f"Stratum {stratum_id}: placed {placed}/{n_plots} — "
                "min_spacing_m too large or stratum too small."
            )

    return gpd.GeoDataFrame(records, crs=extent.crs)

Composed, the four steps form the complete design:

extent  = prepare_boundary(boundary, "EPSG:32610", edge_buffer_m=30.0)
strata  = strata_from_raster("canopy_class.tif", extent)
alloc   = allocate_plots(strata, total_plots=120)          # proportional
plots   = sample_points(strata, extent, alloc, min_spacing_m=80.0, seed=7)

Export and Field-Deployment Configuration

Field hardware is opinionated about formats and CRS. Reproject to the GPS datum only at export — keep all geometry math in the projected CRS — and write a small, explicit configuration alongside the points so a crew lead can reproduce the run.

# sampling_design.yaml — provenance for one design run
project: mixed_conifer_carbon_2026
target_crs: "EPSG:32610"     # NAD83 / UTM 10N — area & distance math
export_crs: "EPSG:4326"      # WGS84 — handheld GPS expectation
edge_buffer_m: 30.0
min_spacing_m: 80.0
total_plots: 120
allocation: proportional      # or: neyman
seed: 7
exports:
  - { driver: GPX,   path: plots.gpx,   waypoint_field: plot_id }
  - { driver: GPKG,  path: plots.gpkg }
  - { driver: CSV,   path: plots.csv,   lon_lat: true }

def export_design(plots: gpd.GeoDataFrame, export_crs: str = "EPSG:4326") -> gpd.GeoDataFrame:
    """Reproject to the GPS datum and write field-ready files."""
    wgs84 = plots.to_crs(export_crs)
    wgs84.to_file("plots.gpkg", driver="GPKG")

    # GPX needs a 'name' column; map plot_id onto it for waypoint labels.
    gpx = wgs84.rename(columns={"plot_id": "name"})[["name", "geometry"]]
    gpx.to_file("plots.gpx", driver="GPX")

    out = wgs84.copy()
    out["lon"] = out.geometry.x
    out["lat"] = out.geometry.y
    out.drop(columns="geometry").to_csv("plots.csv", index=False)
    return wgs84

Document horizontal accuracy alongside the coordinates: a sub-meter design exported to a 3 m-accuracy recreational GPS can shift a plot across a microhabitat boundary, so the field tolerance — not the design precision — governs how plot centers are interpreted on the ground.

Validation and Verification

A sampling design is only defensible if you can prove its plots satisfy the constraints. Run these assertions before any crew mobilizes.

def verify_design(
    plots: gpd.GeoDataFrame,
    extent: gpd.GeoDataFrame,
    allocation: np.ndarray,
    min_spacing_m: float,
) -> None:
    # 1. Count matches the budget exactly.
    assert len(plots) == int(allocation.sum()), "Plot count != allocation."

    # 2. Every plot lies inside the buffered frame.
    frame = extent.geometry.iloc[0]
    assert plots.geometry.within(frame).all(), "Plot outside sampling frame."

    # 3. Per-stratum counts match the allocation.
    counts = plots.groupby("stratum").size()
    for sid, n in counts.items():
        assert n == allocation[sid] or n > 0, f"Stratum {sid} miscount."

    # 4. Minimum spacing is honored.
    if min_spacing_m > 0:
        xy = np.column_stack([plots.geometry.x, plots.geometry.y])
        d = np.hypot(*(xy[:, None, :] - xy[None, :, :]).transpose(2, 0, 1))
        np.fill_diagonal(d, np.inf)
        assert d.min() >= min_spacing_m - 1e-6, "Spacing constraint violated."

For a quick spatial sanity check, plot the strata, the buffered frame, and the points together — clustering, gaps, or points hugging the edge are immediately visible and usually trace back to a too-large buffer or an under-populated stratum.

Derived stand metrics inherit the design’s spatial correctness. Once crews return measurements, standard quantities follow:

Basal area per hectare — sum of individual tree basal areas $π (D B H /2)^{2}$ scaled by the inverse plot area in hectares.
Stand Density Index — $S D I = N (QM D /25.4)^{1.605}$ , with $N$ stems per hectare and $QM D$ the quadratic mean diameter in centimetres; the 25.4 cm reference diameter follows the Reineke formulation used by most North American inventory programs.

Aligning the plot design with a recognized national standard (e.g. the USDA FIA fixed-radius/subplot layout) keeps these outputs interoperable across regional monitoring networks.

Failure Modes and Gotchas

Area in degrees. Calling .area on an EPSG:4326 frame returns square degrees, so allocation weights are meaningless. Always reproject to a metric CRS first.
Edge buffer eats narrow necks. A negative buffer larger than a corridor’s half-width deletes that area from the frame; verify the buffered extent is non-empty and still covers every stratum.
Rejection sampling stalls. A small stratum combined with a large min_spacing_m can make placement impossible; the per-stratum max_tries cap converts an infinite loop into an explicit, actionable error.
Rounding drift. Naively rounding n * W_h rarely sums to n; largest-remainder allocation keeps the total exact and the bias minimal.
Strata not aligned to the frame. If stratum polygons extend beyond the buffered extent, area weights over-count unsampleable land — clip strata to the frame before allocating.
Non-deterministic runs. Any unseeded numpy or random call makes the design impossible to reproduce for audit; thread a single seed through every stochastic step.

Performance and Scale

For landscape-scale designs (hundreds of thousands of hectares, thousands of plots) the bottleneck is the point-in-polygon test inside rejection sampling.

Spatial index the strata. Build an STRtree over stratum geometries and query candidate points in bulk rather than testing one polygon at a time.
Vectorize containment. Shapely 2.0’s contains accepts arrays — generate a block of candidate points per stratum and test them in one vectorized call, keeping the survivors.
Tile large frames. Partition the boundary into a tile grid, allocate per tile-stratum intersection, and sample tiles in parallel with multiprocessing or dask.bag; concatenate the resulting GeoDataFrames.
Cache the polygonized strata. Raster polygonization is the slowest step — write the result to GeoPackage once and reload it across allocation experiments instead of re-polygonizing.
Prefer GeoPackage over Shapefile for intermediate and final outputs: no 10-character field-name truncation, no 2 GB limit, and a single portable file.

Frequently Asked Questions

How many plots do I actually need per stratum?

Solve the stratified variance equation for the precision you require: for a target standard error, $n_{h}$ scales with $W_{h} S_{h}$ under Neyman allocation. In practice, never drop below 2 plots per stratum — a single plot yields no within-stratum variance estimate and breaks error propagation. If the budget forces a stratum below 2, merge it with an ecologically similar neighbor.

Proportional or Neyman allocation — which should I choose?

Use Neyman when you have a credible per-stratum variance estimate (a pilot cruise, a prior inventory, or a strong covariate like canopy height) and the variances differ substantially between strata. Use proportional allocation when variance estimates are weak or unavailable — it still beats simple random sampling and never allocates a misleadingly small count to a high-variance stratum on bad information.

Why a negative (inward) buffer instead of sampling the whole polygon?

Plots near a boundary are hard to access, may straddle a different ownership or treatment, and suffer ecological edge effects that bias stand metrics. An inward buffer of one or two plot radii keeps every plot center surrounded by the population it is meant to represent.

Can I enforce a minimum distance between plots without gridding the landscape?

Yes — the rejection sampler accepts a candidate only if it is at least min_spacing_m from every prior plot, giving you spacing control while preserving randomness. A regular grid trades that randomness for guaranteed spacing and is a separate, systematic design; spacing-constrained random sampling keeps inclusion probabilities tractable.

How do I make the design reproducible for an audit?

Thread a single integer seed through every stochastic function, pin library versions, and serialize the run configuration (CRS, buffer, spacing, allocation rule, seed) to YAML next to the outputs. Re-running with the same inputs and seed reproduces identical coordinates byte-for-byte.

My handheld GPS shows plots a few metres off — is the design wrong?

Almost certainly not. The design is exact in the projected CRS; the offset is the GPS unit’s horizontal accuracy plus datum transformation at export. Record the unit’s stated accuracy in the field protocol and treat the plot center as a target within that tolerance rather than a surveyed monument.

Coordinate Reference Systems for Forestry — choose and validate the projected CRS your design depends on.
Raster–Vector Overlay Techniques — extract covariates at plot locations and clip strata to the frame.
Vegetation Index Calculation in Python — derive NDVI/EVI strata that feed the allocation step.
Stratified random sampling for forest plots — the per-stratum allocation mechanics in depth.

Up one level: Ecological GIS Data Foundations in Python

Explore this section

Stratified Random Sampling for Forest Plots with GeoPandas and Neyman Allocation