Forest Gap & Understory Analysis: A Python GIS Pipeline for Ecological Monitoring

Quantifying canopy discontinuities and the light regimes beneath them is one of the hardest problems in structural forestry: a stand can read as fully closed in aerial imagery while harbouring dozens of small openings that drive regeneration, fuel accumulation, and wildlife movement. This workflow isolates those gap boundaries, computes fragmentation metrics, and estimates photosynthetically active radiation (PAR) reaching the forest floor — entirely from airborne LiDAR derivatives, with a reproducible, code-driven pipeline rather than manual digitization. It sits within the broader Canopy Height Modeling & Terrain Extraction framework, which ensures vertical vegetation structure is properly decoupled from topographic relief before any ecological metric is calculated. The intended reader is a forester, conservation analyst, or Python GIS developer who needs gap statistics that survive peer review and align with field plots.

End-to-end flow: the normalized point cloud is gridded into aligned DTM and CHM rasters (with the parent CHM pipeline feeding the rasterization), then the CHM drives gap detection, vectorization, and the Beer-Lambert PAR estimate. Dashed arrows mark the DTM alignment reference and per-gap context feeds.

Prerequisites checklist

Confirm the following before running the pipeline. Each item is a hard dependency for ecologically valid gap statistics.

Python 3.11+ with pdal 2.6+, rasterio 1.3+, geopandas 0.14+, scipy 1.11+, scikit-image 0.22+, and numpy 1.26+ installed (use conda/mamba for the PDAL binary stack).
A classified, height-normalized point cloud — produced by LiDAR Point Cloud Preprocessing, with ground returns tagged Class 2 and a HeightAboveGround dimension present.
A Canopy Height Model and Digital Terrain Model on an identical grid, generated via Canopy Height Model Creation and Digital Terrain Model Generation.
A metric projected CRS (UTM or State Plane) on every raster — geographic coordinates (EPSG:4326) distort structuring-element radii and invalidate area thresholds.
Pixel resolution between 0.5 m and 2.0 m, chosen so the smallest ecologically meaningful gap (10–50 m² in temperate and boreal systems) spans at least 20–30 pixels.
A site-specific gap definition agreed with the field team: the height threshold (commonly 2–5 m) and minimum area below which an opening is not counted.

Concept background: what counts as a gap

A canopy gap is an opening where vegetation height falls below a species- or site-specific threshold, but a naive CHM < threshold mask conflates true openings with sensor noise, crown shadows, and isolated low pixels. The pipeline therefore treats gap detection as a structural problem on the CHM surface and a spatial-statistics problem on the resulting polygons.

Two quantities anchor the analysis. Fragmentation is summarised per gap by the perimeter-to-area ratio, where a compact circular gap minimises perimeter for a given area:

PA = \frac{P}{A}, shape_index = \frac{P}{2 π A}

A shape_index near 1 indicates a near-circular opening; higher values signal elongated or convoluted edges that expose more interior forest to edge microclimate. Understory light is then approximated with the Beer–Lambert attenuation law, which relates transmitted PAR to leaf area index $L$ along the beam path:

PAR_{understory} = PAR_{above} e^{- k L}

where $k$ is the canopy extinction coefficient (typically 0.4–0.6 for broadleaf canopies) and $L$ is estimated from local canopy cover. Within a gap, $L \to 0$ and transmittance approaches 1; at the gap edge, low solar angles drive partial shading that this point model does not capture, which is why edge validation against hemispherical photography matters.

Step-by-step Python pipeline

The pipeline runs in four numbered stages: rasterize the normalized cloud, threshold and clean the CHM into a gap mask, vectorize and measure each gap, then estimate understory PAR. Each step is independently runnable and writes a georeferenced intermediate so failures are easy to localise.

Step 1 — Rasterize the normalized cloud into an aligned CHM and DTM

The normalized point cloud is gridded into a DTM (the geometric reference plane) and a CHM (the vertical canopy profile). Use output_type="min" for the terrain surface and output_type="max" for the canopy surface so the two rasters share an exact grid and can be subtracted without resampling. Robust Digital Terrain Model Generation avoids artificial terracing on ridgelines that would otherwise leak into the CHM.

import json
import subprocess

def rasterize_chm_dtm(las_path: str, dtm_path: str, chm_path: str, resolution: float = 1.0) -> None:
    """Grid a height-normalized LAS/LAZ into aligned DTM (min) and CHM (max) rasters."""
    pipeline = {
        "pipeline": [
            las_path,
            {
                "type": "writers.gdal",
                "filename": dtm_path,
                "dimension": "Z",
                "output_type": "min",
                "resolution": resolution,
                "nodata": -9999,
            },
            {
                "type": "writers.gdal",
                "filename": chm_path,
                "dimension": "HeightAboveGround",
                "output_type": "max",
                "resolution": resolution,
                "nodata": -9999,
            },
        ]
    }
    subprocess.run(["pdal", "pipeline", "--stdin"], input=json.dumps(pipeline),
                   text=True, check=True)

Step 2 — Threshold and clean the CHM into a gap mask

Thresholding alone leaves salt-and-pepper noise and narrow canopy bridges. Morphological opening removes isolated sub-threshold pixels; closing bridges spurious one-pixel canopy slivers that do not represent real crowns. The dedicated walkthrough on identifying canopy gaps using morphological filters covers the structuring-element calibration in depth.

Why two morphological passes are needed: opening (erosion then dilation) strips isolated speckle pixels without shrinking the true gap, while closing (dilation then erosion) bridges the one-pixel canopy sliver so the split opening dissolves into a single connected polygon.

import numpy as np
import rasterio
from scipy import ndimage
from skimage.morphology import disk

def chm_to_gap_mask(chm_path: str, mask_path: str,
                    height_thresh: float = 3.0, struct_radius_m: float = 4.0) -> None:
    """Threshold a CHM and clean it morphologically into a binary gap raster."""
    with rasterio.open(chm_path) as src:
        chm = src.read(1).astype(np.float32)
        meta = src.meta.copy()
        res = src.res[0]

    selem = disk(max(1, int(round(struct_radius_m / res))))
    below = (chm >= 0) & (chm < height_thresh)          # valid sub-canopy pixels
    opened = ndimage.binary_opening(below, structure=selem)   # drop speckle
    cleaned = ndimage.binary_closing(opened, structure=selem)  # bridge slivers

    meta.update(dtype="uint8", count=1, nodata=0)
    with rasterio.open(mask_path, "w", **meta) as dst:
        dst.write(cleaned.astype(np.uint8), 1)

Step 3 — Vectorize gaps and compute fragmentation metrics

Connected-component vectorization assigns each opening a polygon, after which geopandas yields per-gap area, perimeter, the perimeter-to-area ratio, and a normalized shape index. These statistics drive corridor-viability modeling and silvicultural prioritisation.

import numpy as np
import geopandas as gpd
import rasterio
from rasterio.features import shapes
from shapely.geometry import shape

def vectorize_gaps(mask_path: str, min_area_m2: float = 15.0) -> gpd.GeoDataFrame:
    """Convert a binary gap raster to a GeoDataFrame with area, perimeter and shape metrics."""
    with rasterio.open(mask_path) as src:
        data = src.read(1)
        transform = src.transform
        crs = src.crs

    geoms = [
        {"geometry": shape(geom), "value": val}
        for geom, val in shapes(data, mask=(data == 1), transform=transform)
        if val == 1
    ]
    gdf = gpd.GeoDataFrame(geoms, crs=crs)
    gdf["area_m2"] = gdf.geometry.area
    gdf["perimeter_m"] = gdf.geometry.length
    gdf = gdf[gdf["area_m2"] >= min_area_m2].copy()
    gdf["pa_ratio"] = gdf["perimeter_m"] / gdf["area_m2"]
    gdf["shape_index"] = gdf["perimeter_m"] / (2.0 * np.sqrt(np.pi * gdf["area_m2"]))
    return gdf.reset_index(drop=True)

Step 4 — Estimate understory PAR from canopy cover

The final stage maps the Beer–Lambert relationship onto the CHM-derived canopy cover to produce a per-pixel transmittance surface. Solar geometry can be layered on with pvlib to split direct-beam and diffuse irradiance, which matters for gap-edge microclimate.

import numpy as np
import rasterio

def estimate_understory_par(chm_path: str, par_path: str,
                            cover_height: float = 3.0, k: float = 0.5,
                            lai_max: float = 5.0) -> None:
    """Approximate fractional PAR transmittance with the Beer-Lambert law."""
    with rasterio.open(chm_path) as src:
        chm = src.read(1).astype(np.float32)
        meta = src.meta.copy()

    # Proxy LAI from normalized canopy height; gaps -> ~0 LAI -> transmittance ~1.
    cover_frac = np.clip(chm / cover_height, 0.0, 1.0)
    lai = cover_frac * lai_max
    transmittance = np.exp(-k * lai)

    meta.update(dtype="float32", count=1, nodata=-9999)
    with rasterio.open(par_path, "w", **meta) as dst:
        dst.write(transmittance.astype(np.float32), 1)

PDAL configuration reference

When the rasterization step runs as part of a tiled batch, drive it from a standalone PDAL JSON pipeline so the parameters are version-controlled alongside the outputs. The annotated configuration below grids the canopy surface only; pair it with a matching min block for the DTM.

{
  "pipeline": [
    "normalized_tile.laz",
    {
      "type": "filters.range",
      "limits": "Classification![7:7]"
    },
    {
      "type": "writers.gdal",
      "filename": "chm_tile.tif",
      "dimension": "HeightAboveGround",
      "output_type": "max",
      "resolution": 1.0,
      "gdaldriver": "GTiff",
      "nodata": -9999,
      "data_type": "float32",
      "window_size": 3
    }
  ]
}

Key parameters: filters.range with Classification![7:7] drops low-noise points before gridding; output_type": "max" captures the tallest return per cell for the canopy surface; window_size": 3 enables inverse-distance void filling across small data gaps so empty cells do not register as false canopy openings; and data_type": "float32" preserves sub-metre height precision.

Validation & verification

Gap outputs are only trustworthy once they pass alignment, distribution, and field-agreement checks. Run these assertions before any downstream ecological model consumes the layers.

import numpy as np
import rasterio

def validate_alignment(chm_path: str, dtm_path: str) -> None:
    """Fail fast if the CHM and DTM are not on an identical, projected grid."""
    with rasterio.open(chm_path) as a, rasterio.open(dtm_path) as b:
        assert a.crs == b.crs, "CRS mismatch between CHM and DTM"
        assert a.crs.is_projected, "Rasters must be in a metric projected CRS"
        assert a.transform == b.transform, "Grid origin/resolution differ"
        assert a.shape == b.shape, "Raster dimensions differ"
        chm = a.read(1)
    valid = chm[chm != a.nodata]
    # Canopy heights should be physically plausible for the biome.
    assert np.nanpercentile(valid, 99) < 80.0, "Implausible canopy heights (>80 m)"

Beyond the code checks: confirm CHM height distributions track field-measured tree heights, verify that detected gaps coincide with known disturbance events (wind-throw, harvest blocks), and calibrate the extinction coefficient $k$ against hemispherical photography or quantum-sensor transects so the PAR surface reflects local phenology.

Failure modes & gotchas

CRS mismatch or geographic coordinates — structuring-element radii and area thresholds are meaningless in degrees; reproject every input to a metric CRS and assert crs.is_projected before processing.
NaN / nodata propagation through morphology — scipy.ndimage treats nodata sentinels as real values; mask them to a neutral state (e.g. set sub-zero CHM to 0) before opening and closing, or false gaps bloom across void areas.
Void-fill artifacts read as gaps — aggressive interpolation during DTM/CHM generation can flatten real openings or invent canopy; verify with window_size tuning and inspect filled regions against raw return density.
Threshold too low for the biome — a 2 m threshold floods open woodland and savanna with spurious gaps; raise it and tighten the minimum-area filter for sparse canopies.
Memory overflow on regional mosaics — loading a county-scale CHM into a single array exhausts RAM; use windowed reads or dask arrays with overlap padding equal to at least 2 × struct_radius_m / resolution.
Tile-edge boundary artifacts — gaps split across tile seams are double-counted or truncated; process with overlapping windows and dissolve duplicate polygons after the merge.

Performance & scale notes

For landscape-scale surveys, never materialise the full raster. Process in standard 1 km² tiles with rasterio windowed reads or dask.array blocks, carrying an overlap halo so morphological neighbourhoods and connected components are never truncated at a tile boundary. Vectorize per tile, then merge with geopandas.pd.concat and dissolve on a shared gap id to stitch openings that straddle seams. Parallelise across tiles with concurrent.futures.ProcessPoolExecutor — gap detection is embarrassingly parallel once tiling and overlap are fixed. Cache the intermediate gap mask as a compressed GeoTIFF so the expensive morphological pass is not repeated when only the PAR parameters change.

Frequently Asked Questions

What height threshold should I use to define a canopy gap?

There is no universal value: 2–3 m suits closed temperate and boreal stands, while open woodland and savanna need 1.0–1.5 m paired with a stricter minimum-area filter to exclude understory. Agree the threshold with your field team and document it in the output metadata so results are comparable across acquisitions.

How do I keep the CHM and DTM perfectly aligned?

Grid both from the same normalized point cloud in one PDAL pass at identical resolution, using output_type="min" for the DTM and output_type="max" for the CHM. The validate_alignment assertion above catches any CRS, transform, or shape drift before it corrupts the gap mask.

Why are tiny one-pixel gaps appearing all over my mask?

Those are speckle from sensor noise and crown shadow. Morphological opening with a structuring element sized to the dominant crown radius removes them, and the minimum-area filter in the vectorization step discards anything below your ecological gap definition.

Can I estimate understory light without field calibration?

The Beer–Lambert step gives a relative transmittance surface that is useful for ranking openings, but the absolute PAR values depend on the extinction coefficient $k$ and leaf phenology. Calibrate $k$ against hemispherical photography or quantum-sensor transects before treating the output as quantitative.

How do I scale this to a whole forest district?

Tile the survey into 1 km² blocks with an overlap halo, run detection per tile in parallel, and dissolve duplicate polygons after merging. This keeps memory bounded and makes the morphological pass embarrassingly parallel across cores or nodes.

Do gaps need to be vectorized, or can I keep them as rasters?

Keep the binary raster for fast area summaries, but vectorize when you need per-gap geometry — perimeter, shape index, nearest-neighbour distances, or intersection with habitat and disturbance layers. The vectorize_gaps function produces a topologically clean GeoDataFrame ready for those joins.

Up one level: Canopy Height Modeling & Terrain Extraction

Explore this section

Identifying Canopy Gaps Using Morphological Filters (scipy.ndimage Black Top-Hat)