Stratified Random Sampling for Forest Plots with GeoPandas and Neyman Allocation

Stratified random sampling for forest plots distributes a fixed number of inventory plots across heterogeneous strata — canopy cover classes, elevation bands, or soil moisture gradients — while preserving statistical representativeness and operational feasibility. This page covers the narrow engineering task: turning validated strata polygons into a reproducible set of plot centres in Python, with explicit area-weighted or variance-weighted allocation and automated spatial validation that catches silent failures before field crews mobilise. It is a specific implementation within Spatial Plot Sampling Design, which in turn sits inside the broader Ecological GIS Data Foundations in Python workflow. If your strata boundaries are still unprojected or use a geographic CRS, resolve that first via Coordinate Reference Systems for Forestry — every allocation calculation below assumes metre-based areas.

When to Use Stratified Random Sampling

Stratified designs outperform simple random or systematic placement whenever the target variable (basal area, canopy height, biomass) varies more between strata than within them. The cost is added pipeline complexity: you must define strata, compute their areas, and allocate plots before generating coordinates. Use the decision table below to confirm the approach fits your inventory.

Design	Best when	Variance behaviour	Pipeline cost
Simple random	Landscape is homogeneous; no clear gradients	High variance on heterogeneous stands	Low
Systematic grid	Continuous coverage mapping; spatial autocorrelation modelling	Periodic bias risk if pattern aligns with terrain	Low
Stratified random (proportional)	Distinct ecological classes; want representativeness by area	Lower than simple random; unbiased per stratum	Medium
Stratified random (Neyman)	Strata differ in internal variability; fixed plot budget	Minimised total variance for a given `N`	Medium–high

Within a stratified design, the allocation rule is the key choice. Proportional allocation scales plot counts to stratum area and is the safe default when per-stratum variance is unknown. Neyman allocation assigns more plots to strata that are both large and internally variable, minimising the variance of the overall estimate for a fixed total plot count N. Reach for Neyman when you have a prior variance proxy — NDVI variance or a LiDAR canopy height model standard deviation per stratum — and a tight field budget.

The diagram below shows how the same budget of N = 30 plots is redistributed across three strata when the rule changes. Proportional allocation tracks area alone; Neyman pulls plots toward the high-variance stratum even though its area is smaller, because that is where additional samples cut the most variance.

Strata Preparation and CRS Enforcement

Every downstream area calculation assumes a metre-based, equal-area projection. Overlaying strata against administrative boundaries or terrain masks while still in a geographic CRS (e.g. EPSG:4326) triggers UserWarning: Geometry is in a geographic CRS and yields area values in squared degrees — silently corrupting allocation weights. Project to an equal-area system first: EPSG:6933 (Lambert Cylindrical Equal Area, EASE-Grid 2.0 Global) for continental extents, or NAD83 / Conus Albers (EPSG:5070) for North American studies. Validate topology immediately afterwards, since raster-to-vector conversion and legacy digitisation routinely emit self-intersections.

import geopandas as gpd
import numpy as np
from shapely.validation import make_valid

def prepare_strata(strata_gdf: gpd.GeoDataFrame, target_epsg: int = 6933) -> gpd.GeoDataFrame:
    """
    Enforce equal-area projection, clean topology, and validate strata boundaries.
    """
    if strata_gdf.crs is None:
        raise ValueError("Input GeoDataFrame must have an assigned CRS.")

    # Explicit projection to equal-area system
    strata_gdf = strata_gdf.to_crs(epsg=target_epsg)

    # Topology validation and zero-buffer cleanup for self-intersections
    strata_gdf["geometry"] = strata_gdf.geometry.apply(lambda geom: make_valid(geom))
    strata_gdf["geometry"] = strata_gdf.geometry.buffer(0)

    # Remove degenerate geometries (zero area after clipping/cleaning)
    strata_gdf = strata_gdf[strata_gdf.geometry.area > 0.0].copy()
    strata_gdf["strata_area_m2"] = strata_gdf.geometry.area

    if strata_gdf.empty:
        raise RuntimeError("All strata geometries collapsed to zero area after projection/cleaning.")

    return strata_gdf

Applying make_valid followed by a zero-width buffer(0) resolves self-intersections and ring-orientation issues without shifting real boundaries. For authoritative guidance on projection handling, consult the GeoPandas Coordinate Reference Systems documentation.

Minimal Reproducible Example

The allocation core is the Neyman formula. For stratum $h$ , the plot count is

n_{h} = N \cdot \frac{A _{h} S _{h}}{\sum _{i} A _{i} S _{i}}

where $N$ is the total plot budget, $A_{h}$ is stratum area, and $S_{h}$ is the standard deviation of the target variable. Setting every $S_{h}$ equal collapses this to proportional allocation. The function below adds the two guards real inventories need: a floor of one plot per stratum, and a remainder loop so the counts sum exactly to N.

def allocate_plots_neyman(
    total_n: int,
    areas: np.ndarray,
    std_devs: np.ndarray,
) -> np.ndarray:
    """
    Compute Neyman allocation with floor constraints and remainder distribution.
    Pass a constant std_devs array to fall back to proportional allocation.
    """
    if total_n <= 0:
        raise ValueError("Total plot count must be positive.")
    if np.any(areas <= 0):
        raise ValueError("All strata areas must be strictly positive.")

    # Weighted allocation numerator (A_h * S_h)
    weights = areas * std_devs
    total_weight = np.sum(weights)

    if total_weight == 0:
        raise ValueError("Sum of weighted strata is zero. Check standard deviation inputs.")

    raw_alloc = (total_n * weights) / total_weight

    # Floor constraint: guarantee at least one plot in every critical stratum
    floor_alloc = np.maximum(np.floor(raw_alloc).astype(int), 1)

    # Reconcile the sum back to total_n
    allocated = int(np.sum(floor_alloc))
    remainder = total_n - allocated
    fractional = raw_alloc - np.floor(raw_alloc)

    if remainder > 0:
        # Hand leftover plots to the strata with the largest fractional parts
        top = np.argsort(fractional)[::-1][:remainder]
        floor_alloc[top] += 1
    elif remainder < 0:
        # Floor-minimum may overshoot; trim from smallest-fractional strata
        low = np.argsort(fractional)[:abs(remainder)]
        floor_alloc[low] -= 1
        floor_alloc = np.maximum(floor_alloc, 1)

    return floor_alloc

With allocations in hand, generate plot centres by rejection sampling inside an interior buffer of each polygon. The buffer keeps a circular plot of radius plot_radius entirely within its stratum, preventing partial plots that straddle a class boundary.

from shapely.geometry import Point

def generate_stratified_points(
    strata_gdf: gpd.GeoDataFrame,
    allocations: np.ndarray,
    plot_radius: float,
    seed: int = 42,
) -> gpd.GeoDataFrame:
    """
    Generate random plot centres within strata, enforcing edge buffers.
    `allocations[i]` is positional and must align with the row order of strata_gdf.
    """
    if len(allocations) != len(strata_gdf):
        raise ValueError("allocations length must match the number of strata rows.")

    rng = np.random.default_rng(seed)
    points_data = []

    for i, (label, row) in enumerate(strata_gdf.iterrows()):
        n_plots = int(allocations[i])
        if n_plots <= 0:
            continue
        geom = row.geometry

        # Interior buffer prevents plots from crossing the stratum edge
        valid_geom = geom.buffer(-plot_radius)
        if valid_geom.is_empty or valid_geom.area == 0:
            raise ValueError(f"Stratum {label} too small for plot radius {plot_radius}m.")

        minx, miny, maxx, maxy = valid_geom.bounds
        generated, attempts = 0, 0
        max_attempts = n_plots * 500

        while generated < n_plots and attempts < max_attempts:
            pt = Point(rng.uniform(minx, maxx), rng.uniform(miny, maxy))
            attempts += 1
            if valid_geom.contains(pt):
                points_data.append({
                    "strata_id": label,
                    "plot_radius_m": plot_radius,
                    "geometry": pt,
                })
                generated += 1

        if generated < n_plots:
            raise RuntimeError(
                f"Failed to generate {n_plots} points in stratum {label} "
                f"after {max_attempts} attempts."
            )

    return gpd.GeoDataFrame(points_data, crs=strata_gdf.crs)

Seeding np.random.default_rng makes the coordinate set reproducible across re-runs and field campaigns — essential for auditability. See the NumPy Random Generator documentation for the generator API. If operational constraints require a minimum inter-plot distance, thin the result with a scipy.spatial.KDTree query after generation.

Parameter Reference

Parameter	Type	Default	Recommended range	Ecological rationale
`target_epsg`	`int`	`6933`	`5070` (CONUS), regional UTM zone	Must be equal-area so stratum areas — and therefore allocation weights — are unbiased.
`total_n`	`int`	—	30–60 minimum, scaled to extent	Sets the field budget; too few plots inflates per-stratum variance and breaks the floor constraint.
`std_devs` ( $S_{h}$ )	`np.ndarray`	constant ⇒ proportional	NDVI or CHM std per stratum	Drives Neyman weighting toward internally variable strata; use a remote-sensing proxy when no prior inventory exists.
`plot_radius`	`float` (metres)	—	7.32 m (1/24-acre) to 12.62 m (1/20-ha)	Defines the interior buffer; match it to your fixed-radius plot protocol (e.g. USDA FIA subplot is 7.32 m).
`seed`	`int`	`42`	any fixed integer	Fixing the seed makes the coordinate set reproducible and auditable.
`max_attempts`	derived (`n_plots * 500`)	—	raise for sliver strata	Caps rejection sampling; long, thin strata need more attempts before the cap trips.

Expected Output and Verification

A correct run returns a GeoDataFrame of Point geometries whose row count equals total_n, carrying the source strata_id for each plot and sharing the strata CRS. Treat the four assertions below as a gate: run them before exporting coordinates to KML or GeoPackage for field collection.

def validate_sampling_pipeline(
    strata_gdf: gpd.GeoDataFrame,
    plots_gdf: gpd.GeoDataFrame,
    expected_total: int,
) -> dict:
    """Run automated spatial and statistical validation checks."""
    report = {"status": "PASS", "errors": []}

    # 1. CRS alignment between layers
    if strata_gdf.crs != plots_gdf.crs:
        report["status"] = "FAIL"
        report["errors"].append("CRS mismatch between strata and plots.")

    # 2. Every plot falls inside a valid stratum
    contained = plots_gdf.geometry.apply(lambda p: strata_gdf.geometry.contains(p).any())
    if not contained.all():
        report["status"] = "FAIL"
        report["errors"].append(f"{(~contained).sum()} plots fall outside valid strata boundaries.")

    # 3. Allocation sum equals the requested budget
    if len(plots_gdf) != expected_total:
        report["status"] = "FAIL"
        report["errors"].append(f"Plot count {len(plots_gdf)} != expected {expected_total}.")

    # 4. No invalid plot geometries
    invalid_mask = ~plots_gdf.geometry.is_valid
    if invalid_mask.any():
        report["status"] = "FAIL"
        report["errors"].append(f"{invalid_mask.sum()} invalid plot geometries detected.")

    return report

A passing report ({"status": "PASS", "errors": []}) confirms CRS consistency, 100% plot containment, an exact allocation sum, and clean geometry. When topology validation fails, inspect the is_valid flags and re-apply buffer(0); see the Shapely Geometry Validation manual for the full predicate set.

Common Pitfalls

Geographic CRS leaks into the area term. If prepare_strata is skipped, geometry.area returns squared degrees and every allocation weight is wrong. Assert not strata_gdf.crs.is_geographic before allocating.
Floor constraint overshoots the budget. With many small strata, forcing one plot each can push the sum above total_n. The remainder loop’s negative branch trims it back — never silently drop the reconciliation step.
Index labels vs. positional alignment. allocations is consumed positionally; a non-default or shuffled GeoDataFrame index will misassign counts unless you iterate by position (as generate_stratified_points does) or reset_index(drop=True) first.
Plot radius larger than a sliver stratum. A negative buffer(-plot_radius) can empty long, thin strata entirely, raising ValueError. Either merge slivers upstream or relax the radius for those classes.

Frequently Asked Questions

When should I use proportional allocation instead of Neyman?

Use proportional allocation when you have no reliable per-stratum variance estimate. Neyman only beats proportional when the std_devs you supply genuinely reflect within-stratum heterogeneity; feeding it noisy or guessed values can over-concentrate plots in the wrong strata. Pass a constant std_devs array to allocate_plots_neyman to recover proportional behaviour.

What is a good variance proxy when no prior inventory exists?

Extract a per-stratum standard deviation from a remote-sensing layer that correlates with the target variable — NDVI variance for canopy density, or a LiDAR canopy height model standard deviation for structural variability. Compute these alongside the strata so the proxy aligns spatially with each polygon before deriving consistent extents in Vegetation Index Calculation in Python.

How do I enforce a minimum distance between plots?

Rejection sampling guarantees containment but not spacing. After generation, build a scipy.spatial.KDTree from the point coordinates, query pairs within your minimum distance, and drop or regenerate the closer member. Re-run validate_sampling_pipeline afterwards, since thinning changes the plot count.

Why EPSG:6933 rather than UTM?

EPSG:6933 is a global equal-area projection, convenient for continental or multi-zone inventories where a single UTM zone cannot cover the extent without distortion. For a study confined to one region, a local UTM zone or NAD83 / Conus Albers (EPSG:5070) gives lower local distortion. Either way the projection must be equal-area so stratum areas are unbiased.

How do I export the plots for field crews?

Write the validated GeoDataFrame with plots_gdf.to_file("plots.gpkg", driver="GPKG") for QGIS, or reproject to EPSG:4326 and export KML for handheld GPS units. Document the horizontal accuracy and CRS in the file metadata, since sub-metre shifts can move a plot across a microhabitat boundary.

Spatial Plot Sampling Design — parent workflow covering validation, allocation, deployment, and metric derivation.
Coordinate Reference Systems for Forestry — equal-area projection selection that underpins every allocation weight here.
Raster-Vector Overlay Techniques — extract environmental covariates at generated plot locations.
Vegetation Index Calculation in Python — derive the NDVI variance proxy used for Neyman weighting.
Ecological GIS Data Foundations in Python — the full data-foundations workflow this page belongs to.

Up one level: Spatial Plot Sampling Design · Ecological GIS Data Foundations in Python