Hyperspectral image processing at scale

**Figure 1:** Underwater hyperspectral imaging platform

Acquiring and processing underwater hyperspectral data for ecosystem surveys is a time-consuming task, often requiring weeks to months before the imagery is ready for downstream analysis. Furthermore, sensor characteristics and environmental conditions introduce variability in optical quality, complicating data reliability. For Planblue GmbH, I developed and deployed fully automated processing pipelines that ingest raw hyperspectral data, apply extensive data quality checks, and deliver analysis-ready datasets, reducing the time-to-data from weeks to minutes.

Methodology

After the acquisition of field data, the raw hyperspectral data cubes, ancillary data, and additional sensor data such as irradiance measurements, RGB imagery and navigation data were exposed through a FastAPI backend.

The Kedro framework was used to set up the processing pipelines. Kedro provides utility classes and functions to specify processing nodes, data flow between the nodes and datasets to access data locally or on the cloud (e.g. Kedro S3 datasets).

The Xarray and Rio-Xarray libraries were used to load and transform the hyperspectral and ancillary data. To efficiently process large datasets and improve pipeline throughput, the hyperspectral data cubes were split into smaller batches for parallel processing using Dask.

Intermediary and final data products were stored as Zarr files on Amazon's S3 buckets. Zarr efficiently handles the storage of geospatial data and metadata, and enables quick retrieval of individual hyperspectral bands and regions-of-interest without loading the entire data cube into memory.

Pipelines were containerized using Docker and deployed as separate microservices to handle the radiometric correction, geo-rectification , water-column correction in sequence. Orchestration of pipelines was managed using Airflow.

Results

I developed a scalable processing pipeline that ingests raw hyperspectral data cubes and performs the necessary transformation to deliver high quality, standardarized data products making them ready for downstream analysis.

Hyperspectral processing pipeline — **Figure 2:** Kedro pipeline for processing of hyperspectral data cubes

The pipeline performs several steps. First it applies radiometric calibration to correct for dark current and the sensor's quantum efficiency (Figure 2). Second, the geo-rectification processor maps the hyperspectral pixels to world coordinates, using geolocation estimates from the navigation sensors and photogrammetry process. Lastly, the hyperspectral data is corrected for the attenuation effects of the water column and normalized to the downwelling solar irradiance.

Pipeline throughput was significantly improved by leveraging Dask, Xarray and Zarr for batch and parallell processing of the data. Storage of data products in Zarr enables easy querying and analysis directly from development environments such as Jupyterlab.

Conclusion

The processing time for raw hyperspectral data was reduced from weeks to minutes. These pipelines enable efficient processing of large hyperspectral datasets that do not fit into memory, streamlining the path to near real-time analysis.

References

Brocke, H., Holtrop, T., Kandukuri, R., Rigot, R., Lesage, A., Oehlmann, N., den Haan, J.. (2025). Closing the data gap – Automated seafloor health maps to accelerate nature-based solutions. Int. Hydrographic Review 30, URL