Canopy Height Model Creation: A Python Workflow for Forestry & Ecological Applications

A Canopy Height Model (CHM) is the raster that turns a LiDAR survey into a measurement of how tall the forest actually stands above the ground beneath it. Picture a 2,000-hectare mixed-conifer inventory where a carbon project demands per-stand aboveground biomass with documented uncertainty: every height estimate, every gap delineation, and every allometric input traces back to a CHM that was built correctly. Generating one is deceptively simple in principle — subtract terrain elevation from surface elevation — but in practice it is where coordinate-system errors, interpolation overshoot, and silent NaN propagation quietly corrupt downstream ecology. This page implements a verified Python pipeline that takes a Digital Surface Model and a Digital Terrain Model to a clean, height-above-ground raster, and sits within the broader Canopy Height Modeling & Terrain Extraction workflow that governs cross-project consistency and regulatory reporting.

Prerequisites Checklist

Confirm each item before running the pipeline — most CHM defects are traceable to a skipped check here:

Python 3.10+ with rasterio >= 1.3, numpy >= 1.24, and scipy >= 1.10 installed (GDAL backing rasterio must be 3.4+)
A Digital Surface Model (DSM) — first-return or highest-hit raster covering the area of interest
A validated Digital Terrain Model (DTM) from classified ground returns (see Digital Terrain Model Generation)
Both rasters share a metric, projected CRS (UTM or State Plane) — never geographic degrees
Vertical datums match (e.g. both orthometric NAVD88, or both ellipsoidal) — a mixed datum injects a constant height bias
Input point clouds have already passed LiDAR Point Cloud Preprocessing (outlier removal, ground classification)
A target cell size chosen for the ecological scale (0.5 m for crown-level work, 1 m for stand metrics)

Concept Background: Height Above Ground as Raster Difference

The CHM is defined pixel-wise as the difference between the surface and terrain elevations on a shared grid:

H_{canopy} (x, y) = max (0, Z_{DSM} (x, y) - Z_{DTM} (x, y))

The $max (0, \cdot)$ clamp matters: bilinear interpolation of the terrain surface can locally exceed the surface model near sharp breaks (cliff edges, building footprints that survived filtering), producing physically impossible negative heights. Those negatives, if left in, bias every zonal statistic computed downstream.

The operation is only valid when both operands describe the same location at the same resolution. Two rasters that “look aligned” in a GIS viewer frequently differ in affine transform, origin offset, or CRS — and NumPy will happily subtract two arrays of equal shape regardless of whether their pixels correspond to the same ground coordinate. The pipeline below therefore resamples both surfaces onto one explicitly constructed target grid before any arithmetic, so geometric correspondence is guaranteed rather than assumed.

Step-by-Step Python Pipeline

The following steps build a CHM end to end. Each block is runnable in isolation given the inputs from the previous step.

Step 1 — Establish a common target grid from the DTM

Treat the DTM’s extent and CRS as authoritative, and derive a target affine transform at the desired cell size. This single grid becomes the destination for both rasters, which is what makes the later subtraction shape- and geometry-safe.

import rasterio
from rasterio.warp import calculate_default_transform

def build_target_grid(dtm_path: str, cell_size: float = 0.5):
    """Return (crs, transform, width, height) for a common output grid."""
    with rasterio.open(dtm_path) as dtm_src:
        target_crs = dtm_src.crs
        transform, width, height = calculate_default_transform(
            dtm_src.crs, target_crs,
            dtm_src.width, dtm_src.height,
            *dtm_src.bounds, resolution=(cell_size, cell_size),
        )
    return target_crs, transform, width, height

Step 2 — Reproject both surfaces onto the common grid

Reproject the DSM and DTM onto the identical destination grid. Using Resampling.bilinear smooths interpolation artifacts on continuous elevation surfaces; reserve nearest-neighbour for categorical rasters only.

import numpy as np
from rasterio.enums import Resampling
from rasterio.warp import reproject

def resample_to_grid(src_path, target_crs, transform, width, height):
    """Reproject a single-band raster onto the target grid, returning a float32 array."""
    out = np.full((height, width), np.nan, dtype=np.float32)
    with rasterio.open(src_path) as src:
        reproject(
            source=rasterio.band(src, 1),
            destination=out,
            src_transform=src.transform, src_crs=src.crs,
            dst_transform=transform, dst_crs=target_crs,
            resampling=Resampling.bilinear,
        )
    return out

Step 3 — Difference the surfaces with explicit NaN and negative handling

Compute the height-above-ground only where both surfaces carry finite values and the canopy genuinely sits above the terrain, then clamp residual negatives to zero.

def difference_surfaces(dsm_data, dtm_data):
    """Pixel-wise DSM - DTM with NaN-safe masking and a zero floor."""
    valid = np.isfinite(dsm_data) & np.isfinite(dtm_data)
    chm = np.where(valid & (dsm_data > dtm_data), dsm_data - dtm_data, np.nan)
    chm = np.where(np.isnan(chm), np.nan, np.clip(chm, 0.0, None))
    return chm.astype(np.float32)

Step 4 — Write the CHM with preserved geospatial metadata

Persist the result as a tiled, compressed GeoTIFF that carries the target CRS, transform, and a NaN nodata flag so downstream tools mask correctly.

def create_chm(dsm_path: str, dtm_path: str, chm_output: str, cell_size: float = 0.5):
    """Derive a CHM by resampling both surfaces to a shared grid and differencing."""
    target_crs, transform, width, height = build_target_grid(dtm_path, cell_size)
    dtm_data = resample_to_grid(dtm_path, target_crs, transform, width, height)
    dsm_data = resample_to_grid(dsm_path, target_crs, transform, width, height)
    chm = difference_surfaces(dsm_data, dtm_data)

    profile = {
        "driver": "GTiff", "dtype": "float32", "count": 1,
        "height": height, "width": width,
        "transform": transform, "crs": target_crs, "nodata": np.nan,
        "tiled": True, "compress": "deflate", "predictor": 3,
    }
    with rasterio.open(chm_output, "w", **profile) as dst:
        dst.write(chm, 1)
    return chm_output

Step 5 — Smooth high-frequency noise (optional but recommended)

Raw CHMs carry sensor speckle and understory clutter. A Gaussian filter keyed to expected crown size improves interpretability without merging adjacent crowns. Convert the smoothing scale from metres to pixels with sigma_px = sigma_m / cell_size; for 4 m crowns at 0.5 m resolution, sigma ≈ 2–3 pixels is appropriate.

from scipy.ndimage import gaussian_filter

def smooth_chm(chm: np.ndarray, sigma: float = 1.5, nodata: float = np.nan) -> np.ndarray:
    """Gaussian-smooth a CHM while preserving nodata regions (sigma in pixels)."""
    valid = np.isfinite(chm)
    filled = np.where(valid, chm, 0.0)
    smoothed = gaussian_filter(filled, sigma=sigma)
    smoothed[~valid] = nodata
    return smoothed

The smoothed surface is the standard input for threshold-based canopy cover calculation and for gap delineation in Forest Gap & Understory Analysis.

PDAL Configuration: Rasterizing Surfaces Directly From Returns

When you do not yet have DSM and DTM rasters, PDAL can produce both from a single classified point cloud in one pass. The DSM uses the maximum Z per cell (highest hit); the DTM uses the minimum Z of ground-classified returns. Annotated pipeline:

{
  "pipeline": [
    "normalized_tile.laz",
    {
      "type": "writers.gdal",
      "filename": "dsm.tif",
      "resolution": 0.5,
      "output_type": "max",
      "gdaldriver": "GTiff",
      "nodata": -9999
    },
    {
      "type": "filters.range",
      "limits": "Classification[2:2]"
    },
    {
      "type": "writers.gdal",
      "filename": "dtm.tif",
      "resolution": 0.5,
      "output_type": "min",
      "window_size": 8,
      "gdaldriver": "GTiff",
      "nodata": -9999
    }
  ]
}

Key parameters: resolution must equal the cell_size you pass to create_chm; output_type="max" captures canopy tops for the DSM while "min" over ground-only returns (isolated by filters.range on Classification 2) builds the terrain; window_size controls how far PDAL fills gaps in sparse ground coverage by inverse-distance interpolation. Keep nodata consistent so masking survives the handoff into rasterio.

Validation & Verification

A CHM is not finished until it has been checked geometrically and ecologically.

Geometric alignment — assert that the written CHM shares the DTM’s grid before trusting any statistic:

def assert_aligned(chm_path: str, dtm_path: str):
    with rasterio.open(chm_path) as chm, rasterio.open(dtm_path) as dtm:
        assert chm.crs == dtm.crs, "CRS mismatch between CHM and DTM"
        assert chm.transform.almost_equals(dtm.transform), "affine transform mismatch"
        assert (chm.width, chm.height) == (dtm.width, dtm.height), "shape mismatch"
    print("CHM is aligned to the DTM grid.")

Ecological ground-truth — extract CHM values at surveyed tree locations and regress predicted against observed height, reporting RMSE and bias. Field plots, UAV photogrammetry, or terrestrial laser scanning all serve as reference:

def validate_against_field(chm_path: str, tree_points):
    """tree_points: iterable of (x, y, measured_height_m). Returns RMSE and mean bias."""
    with rasterio.open(chm_path) as src:
        samples = [(h, next(iter(src.sample([(x, y)])))[0]) for x, y, h in tree_points]
    obs = np.array([s[0] for s in samples], dtype=np.float64)
    pred = np.array([s[1] for s in samples], dtype=np.float64)
    mask = np.isfinite(pred)
    rmse = float(np.sqrt(np.mean((pred[mask] - obs[mask]) ** 2)))
    bias = float(np.mean(pred[mask] - obs[mask]))
    return {"rmse_m": rmse, "bias_m": bias, "n": int(mask.sum())}

A well-built CHM over closed-canopy forest typically achieves an RMSE of 0.5–1.5 m against dominant-tree field heights; a large positive bias points to vertical-datum mismatch, while a large negative bias suggests the DTM has been pulled up into the canopy by misclassified vegetation returns.

Failure Modes & Gotchas

CRS or datum mismatch: geographic-degree inputs or mixed vertical datums produce a uniform height offset that validation reads as bias — enforce a projected CRS and identical datums up front.
Silent NaN propagation: subtracting arrays where one carries -9999 nodata instead of NaN yields enormous spurious heights; normalise nodata to NaN before differencing.
Negative-height artifacts: interpolation overshoot near terrain breaks creates sub-zero pixels — the np.clip(..., 0.0, None) floor in Step 3 removes them.
Resolution drift: a DSM at 0.3 m and a DTM at 1 m will resample to mismatched detail; build both on one grid (Steps 1–2) rather than differencing native rasters.
Over-smoothing: a Gaussian sigma larger than half the mean crown radius merges adjacent crowns, flattening the very structure canopy-cover and gap metrics depend on.
Memory overflow on regional surveys: loading a multi-gigapixel mosaic into one array exhausts RAM — process by tile (see below).

Performance & Scale Notes

For surveys beyond a few hundred hectares, never materialise the full mosaic in memory. Two complementary strategies keep the pipeline within budget:

Windowed processing: iterate over rasterio block windows, differencing one tile at a time and writing into a pre-allocated output dataset. Add a small overlap buffer when smoothing so the Gaussian kernel does not introduce seams at tile edges.
Tile-parallel execution: because each tile is independent, distribute them across cores with concurrent.futures.ProcessPoolExecutor or across machines with Dask; a 1 km² tiling scheme (matching common LiDAR delivery units) parallelises cleanly and bounds peak memory per worker.

from rasterio.windows import Window

def chm_windowed(dsm_path, dtm_path, out_path, block=2048):
    """Difference DSM and DTM block-by-block (assumes the two share a grid)."""
    with rasterio.open(dtm_path) as dtm, rasterio.open(dsm_path) as dsm:
        profile = dtm.profile | {"dtype": "float32", "nodata": np.nan,
                                 "tiled": True, "compress": "deflate"}
        with rasterio.open(out_path, "w", **profile) as dst:
            for row in range(0, dtm.height, block):
                for col in range(0, dtm.width, block):
                    win = Window(col, row,
                                 min(block, dtm.width - col),
                                 min(block, dtm.height - row))
                    chm = difference_surfaces(dsm.read(1, window=win),
                                              dtm.read(1, window=win))
                    dst.write(chm, 1, window=win)
    return out_path

Frequently Asked Questions

Should I build the DSM and DTM separately or rasterize both from one point cloud?

Both are valid. If you already have quality-checked DSM and DTM rasters, feed them straight into create_chm. If you only have a classified, normalised point cloud, the PDAL writers.gdal pipeline above produces both surfaces in one pass — just keep resolution identical across the two writers so no resampling is needed later.

Why clamp negative canopy heights to zero instead of treating them as nodata?

Negative values inside the forest interior are interpolation noise, not missing data, so flooring them to zero preserves the cell as valid ground-level canopy. Reserve NaN for cells where the DSM or DTM genuinely had no return. If you mask negatives as nodata instead, you punch holes in otherwise-covered ground and bias canopy-cover percentages downward.

What cell size should I use for a CHM?

Match it to the ecological question. Individual-tree crown delineation and precision forestry want 0.5 m or finer; stand-level metrics, biomass mapping, and most regional inventories are well served by 1 m. Going finer than the native point density supports only manufactures detail — check the point spacing from LiDAR Point Cloud Preprocessing before choosing.

How do I know whether a height bias comes from the DSM or the DTM?

Run the field validation and inspect the sign and magnitude. A uniform positive bias across all trees usually means a vertical-datum offset (ellipsoidal vs orthometric) applied to one surface. A bias that grows with canopy density points to the DTM being lifted by misclassified vegetation returns — revisit ground classification thresholds in the terrain pipeline.

Can I skip resampling if my DSM and DTM already share a CRS?

Only if they also share the same affine transform, origin, and shape. Sharing a CRS is necessary but not sufficient: a one-pixel origin offset still misaligns the subtraction. The cheap insurance is the assert_aligned check — if it passes you can difference directly with chm_windowed; if it fails, route through the resampling path.

LiDAR Point Cloud Preprocessing — clean, classified returns that feed CHM generation
Digital Terrain Model Generation — the bare-earth surface subtracted to derive height
Calculating Canopy Cover From CHM in Python — threshold a CHM into fractional cover
Forest Gap & Understory Analysis — delineate gaps and vertical structure from the CHM

Up one level: Canopy Height Modeling & Terrain Extraction

Explore this section

Calculating Canopy Cover Fraction from a CHM with rasterio and NumPy