Introduction

The notebook demonstrates the use of aireo_lib , a python library created as a part of the AIREO (Artificial Intelligence Ready Earth Observation training datasets) project. The project aims to make EO datasets easily accessible for the ML (Machine Learning) community. As such, AIREO specifications (shorthand specs) which define metadata elements to be included with the training dataset are proposed, supplemented by a best-practices document which suggests how to fill those metadata elements. Finally, the library takes all into account and implements specs, best-practices and offers an easy-to-use pythonic interface bridging the gap between EO and ML community.

Therefore, this notebook is divided into two sections, one for the training dataset creator (usually from the EO community) and the other for its user (usually from the ML community). The structure of the notebook is the following:

1) For Creator

- Create a [STAC](https://stacspec.org/) catalog object using the library

- Populate metadata elements prescribed by the AIREO specs

- Generate a STAC metadata directory using the library

- Check AIREO compliance level and metadata completeness


2) For User

- Create a training dataset object as defined in the library using only the STAC metadata

- Get example instances from the object and other dataset variables like the number of instances, etc.

- Use library's functions to plot the data

- Investigate statistics using the library

About the training dataset

SpaceNet7 is a collection of satellite image time series covering 100 different locations in a diverse range of environments. It also includes a set of polygons for each image marking out the building footprints. The image data was collected by Planet satellites at an interval of roughly one month for each location for a period of two years. This is not compeltely consistent across different locations. The data was then manually annotated by a team at SpaceNet to produce the building footprint polygons. The full data from 60 areas was released as a public dataset.

We have simplified the dataset slightly for the purpose of creating a pilot AIREO dataset. We have converted all building polygons to boolean raster masks, and have stored these and the images in netcdf files as time series of raster data, one for each location. We have also standardized image resolution across all locations. This notebook only makes use of 5 AOIs out of 60 for demonstration purposes.

This dataset can be used to train ML models for the task of automatic building detection in satellite images, and for the tracking of building development over time.

AIREO STAC Catalog basics

The AIREO specs propose a hierarchical structure for STAC metadata. It is a two level structure where the dataset is represented by a collection of AOIs (Area Of Interests), hence, the dataset and AOI being the two levels.

  1. At the dataset level we have a dataset catalog whose metadata elements are the core elements proposed in the AIREO spec. In addition to it, the common metadata elements across each AOI are also at the dataset level, which we shall call root level henceforth. Here, for each data variable there is a separate json which is a STAC Item by definition and is named using the field_schema metadata element. Additionally, there is also a datasheet file in markdown format at the root level which contains human readable information about the key elements of the dataset.

  2. Each AOI has a separate folder within the root level. And in each AOI folder there is a STAC collection representing that AOI and additional json files for each data variable. The additional json files here too, are STAC Items and follow a similar naming convention to the ones at the root level. The assets for each AOI, i.e. the files containing actual data are also in the folder.

The diagram below summarises this hierarchical structure:

Root level (dataset)
│
│   DatasetCatalog.json
│   datasheet.md
│   references_output1.json
│   features_input1.json
│   ...
│
│
└───AOI 1
│      1.json (AOI Collection)
│      feature_input1.json
│      reference_output1.json
│      <reference_asset>
│      <feature_asseet>
│   
│   
└───AOI 2
│      ...
│   
│
└───AOI 3
│      ...
│   
...

Creating a STAC catalog with aireo_lib

The aireo_lib library makes it easier to generate the STAC metadata directory as defined above. Some of the useful functionalities in the library are:

Follow the code and comments below to understand the steps needed to generate STAC metadata with the library.

stac_generated folder creation

Prior to this step, you will need to create a folder called 'biomass_stac_generated' in your environment and leave it empty

Checking AIREO compiance level
Checking metadata completeness

Defining AOI class

Dataset user

The user of the dataset can access most of what is offered by the dataset using just its STAC catalog. All he/she needs to do is create a dataset object by passing to it the path to the STAC catalog at the root level. The library automatically reads in all the metadata and loads the assets into the dataset object. Some of the functionalities that a dataset object offers through aireo_lib are:

Parsing TDS

Plotting functions in aireo_lib

Statistics functions in aireo_lib