STAC monitoring pipeline

Run a monitoring simulation anywhere in the world using STAC

This example demonstrates how to retrieve satellite image time-series from a cloud archive, given a pair of coordinates anywhere in the world, and use them for a monitoring simulation. Here, “simulation” does not refer to simulated data; the example uses real data to detect actual land dynamics. The term highlights the retrospective nature of the exercise, distinguishing it from real-time monitoring systems. For this example, data is retrieved from the Microsoft Planetary Computer archive, which is indexed in a Spatio-Temporal Asset Catalogue (STAC), making it queryable in an interoperable way. While querying and preparing data extends slightly beyond the nrt package’s scope, we aim to provide a comprehensive view of the entire pipeline. Specifically, this example demonstrates how to:

Query Landsat Collection 2 data for any location using the Planetary Computer STAC catalogue
Load a datacube as an in-memory xarray dataset using odc-stac and dask
Prepare the Landsat data for further analysis (mask clouds and shadows, apply scaling and offset, compute vegetation indices)
Generate a simple rule-based forest mask
Simulate a near real-time monitoring scenario

# sphinx_gallery_thumbnail_path = '_static/iqr_results_bolivia.png'

Define the study area

Here we prepare a bounding box of 20 by 20 km, centered on a forest area near the city of Concepción in Bolivia (The area is further described in [1]). Center coordinates are in longitude, latitude which is the Coordinate Reference System we generally get from most online services (google maps or simply typing “name of the place coordinates” in a search engine). To expand these coordinates to a bounding box of 20 km side, we transform the center point to a metric equidistant custom CRS (local UTM, if known could be used as well) and expand the point. Because the STAC API, later used to query the data, expects a bounding box with coordinates in EPSG:4326, we convert the bbox back to that CRS.

from shapely.geometry import Point
from shapely.ops import transform
from pyproj import CRS, Transformer

# Define centroid as a shapely point (in EPSG:4326)
centroid = Point(-61.725, -16.162)
# Custom equidistant CRS centered on the centroid
local_crs = CRS(proj='aeqd', datum='WGS84', lat_0=centroid.y, lon_0=centroid.x)
transformer = Transformer.from_crs(CRS.from_epsg(4326), local_crs, always_xy=True)
# Transform centroid to custom CRS, expand to bbox and transform back to EPSG:4326
centroid_proj = transform(transformer.transform, centroid)
bbox_proj = centroid_proj.buffer(10000).bounds
bbox = transformer.transform_bounds(*bbox_proj, direction='INVERSE')

Query and load the data

Here we query 5 years of Landsat data that intersect the previously defined study area. pystac-client is used to handle the query, while we use odc-stac in combination with dask to efficiently load the data as an xarray.Dataset. The local Dask cluster allows loading data in parallel, which can speed up the overall process. However, the main bottleneck here is network speed, and parallelization may not be very advantageous on a slow network, particularly for this example, which does not perform any data warping.

The resulting Dataset has the following dimensions (y: 671, x: 672, time: 67) and contains four data variables (red, nir08, swir22 and qa_pixel).

import datetime

import xarray as xr
import numpy as np
from pystac_client import Client as psClient # psClient to not be confused with dask's Client
from odc.stac import stac_load, configure_rio
from dask.distributed import Client, LocalCluster
import planetary_computer as pc


# Set up a local dask cluster and configure stac-odc to efficiently deal with cloud hosted data
# while taking advantage of the dask cluster
# Different types of configuration will be re
cluster = LocalCluster(n_workers=5, threads_per_worker=2)
client = Client(cluster)
configure_rio(cloud_defaults=True, client=client)


# Open catalogue connection and query data
date_range = [datetime.datetime(2017,1,1), datetime.datetime(2021,12,31)]
catalog = psClient.open('https://planetarycomputer.microsoft.com/api/stac/v1',
                        modifier=pc.sign_inplace)
# Query all Landsat 8 and 9 that instersect the spatio-temporal extent and have a
# scene level cloud cover < 50%
query = catalog.search(collections=["landsat-c2-l2"],
                       bbox=bbox,
                       datetime=date_range,
                       query={"eo:cloud_cover": {"lt": 50},
                              "platform": {"in": ["landsat-8", "landsat-9"]}})

# Load the required bands as a lazy (dask based) Dataset
ds = stac_load(query.items(),
               bands=['red', 'nir08', 'swir22', 'qa_pixel'],
               groupby='solar_day',
               chunks={'time': 1},
               bbox=bbox,
               patch_url=pc.sign,
               fail_on_error=False)

# Trigger computation to bring the data in memory (depending on network speed, this step may
# take up to a few minutes)
ds = ds.compute()
ds

<xarray.Dataset>
Dimensions:      (y: 671, x: 672, time: 67)
Coordinates:
  * y            (y) float64 -1.777e+06 -1.777e+06 ... -1.797e+06 -1.797e+06
  * x            (x) float64 6.262e+05 6.263e+05 ... 6.464e+05 6.464e+05
    spatial_ref  int32 32620
  * time         (time) datetime64[ns] 2017-01-28T14:10:25.141063 ... 2021-11...
Data variables:
    red          (time, y, x) uint16 7922 7918 7960 7946 ... 8766 8802 8588 8453
    nir08        (time, y, x) uint16 19040 18943 19307 ... 19187 20724 22073
    swir22       (time, y, x) uint16 9048 9042 9069 9116 ... 10815 10658 10374
    qa_pixel     (time, y, x) uint16 21824 21824 21824 ... 21824 21824 21824

Data Preparation

Data preparation involves three primary steps:

Masking the Data: Use the qa_pixel layer to mask pixels contaminated by clouds and shadows. Observations classified as “invalid” due to cloud or shadow coverage are converted to np.nan to signify missing data.
Applying Scale and Offset: Adjust the raw satellite data by applying necessary scaling factors and offsets to convert pixel values into calibrated surface reflectance values.
Computing Vegetation Indices: Compute the Normalized Difference Vegetation Index (NDVI) which will be used later on in the monitoring process

Cloud Masking

Each Landsat scene includes a qa_pixel layer that provides bit-encoded Quality Assurance flags. This method of encoding uses bits to represent different conditions such as cloud, cloud shadow, and snow, allowing a vast range of scenarios to be compactly represented without the need for an extensive table of correspondences. This bit encoding approach, while efficient, makes interpreting mask values less intuitive compared to simpler methods like the Sentinel-2 SCL mask, which uses mutually exclusive integers to classify conditions.

According to the Landsat Collection 2 Product Guide on page 13, conditions such as fill, dilated cloud, cirrus, cloud, and cloud shadow are encoded into bits 0, 1, 2, 3, and 4, respectively. In temperate regions, it might also be necessary to mask snow, although it is generally not a concern for areas without snow coverage. The bit mask for these conditions is 0001 1111, which is more conveniently represented in hexadecimal as 0x1F. Pixels in the qa_pixel layer that have any of these bits set will return non-zero values when processed with np.bitwise_and and the mask 0x1F.

Scale and offset

Still according to the product guide (p.12), a scale and an offset must be applied to each band to obtain surface reflectance.

Warning

Note that the presence of the offset makes this step absolutely necessary, even when working with ratios or normalized indices such as the NDVI. In the past many satellite datasets only had a scaling factor but no offset, making it possible to work directly on raw data.

# Compute binary mask
mask = np.bitwise_and(ds.qa_pixel.values, 0x1F) == 0
# Convert invalid pixels to np.nan and drop qa_pixel
ds_clean = ds.where(mask).drop_vars('qa_pixel')
# Apply scaling and offset
ds_clean = ds_clean * 0.0000275 - 0.2
# Compute NDVI
ds_clean['ndvi'] = (ds_clean.nir08 - ds_clean.red) / (ds_clean.nir08 + ds_clean.red)

# Split the dataset into history and monitoring period
history = ds_clean.sel(time=slice(datetime.datetime(2017,1,1),
                                  datetime.datetime(2019,12,31))) # 3 years
monitor = ds_clean.sel(time=slice(datetime.datetime(2020, 1, 1), None))

Forest mask

It is generally recommanded, for more targetted monitoring, to provide a forest mask during instantiation of one of the nrt.monitor classes. There are many ways to obtain a forest mask, such as using global products or from a custom local classification. Here we use a rule based approach proposed by Zhu et al. (2012) as part of an article on continuous forest disturbances mapping [2]. The reasoning of that rule based algorithm is that forests have a high NDVI (high “greenesss”) and are relatively “dark” in the short wave infrared part of the light sprectrum. The long term NDVI and SWIR values both correspond to the intersect of a temporal regression with a single annual harmonic component. The following rule can then be used to distinguish forested from non-forested land.

IF beta0_NDVI > 0.6 AND beta0_SWIR2 < 0.1:
    pixel = Forest
ELSE:
    pixel = non-Forest

Intersect thresholds of 0.6 and 0.1 are those used in Zhu et al. (2012) and were defined for termporate forests in the United States of America. Here, we are in semi-deciduous context and forests are likely “greener” overall. We therefore adjusted the thresholds to exclude as many non-forested pixels as possible from the forest mask.

To quickly compute these long term NDVI and SWIR intersects in a non-verbose way, we can (mis)use the private ._fit method present in all nrt.monitor classes.

from nrt.monitor.iqr import IQR
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from matplotlib_scalebar.scalebar import ScaleBar

# Prepare matrix of regressors
X = IQR.build_design_matrix(dataarray=history.ndvi,
                            trend=False, harmonic_order=1)
# Instantiate an arbitrary class from nrt.monitor
iqr = IQR()
# Use _fit to get beta arrays
beta_ndvi, _ = iqr._fit(X, history.ndvi)
beta_swir, _ = iqr._fit(X, history.swir22)
# Apply logical rule (first 'layer' of beta arrays is the intersect)
# Thresholds adjusted to be more restrictive
forest = np.logical_and(beta_ndvi[0] > 0.7, beta_swir[0] < 0.09).astype(np.uint8)

# Visualize
forest_cmap = ListedColormap([(0, 0, 0, 0), (0, 1, 0, 0.4)])
scalebar = ScaleBar(30)
plt.figure(figsize=(8, 8))
plt.imshow(history.ndvi[21], cmap='gray', interpolation='none',
           vmin=0.2, vmax=1.3)
plt.imshow(forest, cmap=forest_cmap, interpolation='none')
ax = plt.gca()
ax.add_artist(scalebar)
ax.set_xticks([])
ax.set_yticks([])

Monitoring simulation

In this section, we simulate near real-time monitoring by using three years of historical data for model fitting, followed by monitoring over a two-year period. Here are key details about the chosen monitoring algorithm and parameters:

Monitoring Algorithm: We chose the Inter-Quantile Range (IQR) algorithm. This simple method counts consecutive anomalies to confirm a disturbance, making it effective for detecting abrupt changes like deforestation in environments with high natural variability. It is sensitive enough to pick up real disturbances without being falsely triggered by occasional extreme conditions, such as a particularly dry dry-season that temporarilly causes large regional drops in greeness.
Baseline Model: We used a robust fitting method (RIRLS) to establish the baseline model. This technique reduces the influence of outliers, such as an extreme drought year in the historical data, ensuring the model is stable.

# Instantiate IQR class
model = IQR(mask=forest,
            harmonic_order=1,
            trend=False,
            sensitivity=1.4,
            boundary=4)
# Fit temporal model over stable history period
model.fit(dataarray=history.ndvi, method='RIRLS', maxiter=5)

# Run .monitor on each observation of the monitoring period, one at the time
for date in monitor.time.values.astype('M8[s]').astype(datetime.datetime):
    ds_sub = monitor.sel(indexers={'time': date}, method='nearest')
    model.monitor(array=ds_sub.ndvi.values, date=date)

# Visualize
mask_cmap = ListedColormap([(0, 0, 0, 0),
                            (0, 1, 0, 0.1),
                            (1, 0, 0, 0.1),
                            (1, 0, 1, 0.4)])
scalebar = ScaleBar(30)
plt.figure(figsize=(8, 8))
plt.imshow(ds_clean.ndvi[60], cmap='gray', interpolation='none',
           vmin=0.2, vmax=1.3)
plt.imshow(model.mask, cmap=mask_cmap, interpolation='none')
ax = plt.gca()
ax.add_artist(scalebar)
ax.set_xticks([])
ax.set_yticks([])

The resulting mask reveals clear signs of agricultural expansion in the east and south-west of the study area, as well as more subtle traces of selective logging in the north. Patterns in the north-west corner are likely errors due to unexpected natural variability.

References

Gallery generated by Sphinx-Gallery