Introduction

The notebook demonstrates the use of aireo_lib , a python library created as a part of the AIREO (Artificial Intelligence Ready Earth Observation training datasets) project. The project aims to make EO datasets easily accessible for the ML (Machine Learning) community. As such, AIREO specifications (shorthand specs) which define metadata elements to be included with the training dataset are proposed, supplemented by a best-practices document which suggests how to fill those metadata elements. Finally, the library takes all into account and implements specs, best-practices and offers an easy-to-use pythonic interface bridging the gap between EO and ML community.

Therefore, this notebook is divided into two sections, one for the training dataset creator (usually from the EO community) and the other for its user (usually from the ML community). The structure of the notebook is the following:

1) For Creator

- Create a [STAC](https://stacspec.org/) catalog object using the library

- Populate metadata elements prescribed by the AIREO specs

- Generate a STAC metadata directory using the library

- Check AIREO compliance level and metadata completeness


2) For User

- Create a training dataset object as defined in the library using only the STAC metadata

- Get example instances from the object and other dataset variables like the number of instances, etc.

- Use library's functions to plot the data

- Investigate statistics using the library

About the training dataset

The AI4Artic ASIP Sea Ice Dataset contains 461 files in network Common Data Form (netCDF), coming from Sentinel 1 SAR and AMSR2 microwave radiometer imagery with corresponding ice charts from Danish Meteorological Institute of the Arctic area.

Data is in netCDF format with over 300 GB of it accompanied by a well written manual. Sentinel 1 data is 90m resolution with 40 x 40m pixel spacing and AMSR2 data is resmpled to its pixels. Each satellite image comes with a timestamp and the data is acquired between 2018-2019. It also contains excel sheet where it can be found all images IDs along with the percentage of water and ice present per image. Finally it contains as well a shapefile with all S1 scenes bounding boxes.

TDS Description

AIREO STAC Catalog basics

The AIREO specs propose a hierarchical structure for STAC metadata. It is a two level structure where the dataset is represented by a collection of AOIs (Area Of Interests), hence, the dataset and AOI being the two levels.

  1. At the dataset level we have a dataset catalog whose metadata elements are the core elements proposed in the AIREO spec. In addition to it, the common metadata elements across each AOI are also at the dataset level, which we shall call root level henceforth. Here, for each data variable there is a separate json which is a STAC Item by definition and is named using the field_schema metadata element. Additionally, there is also a datasheet file in markdown format at the root level which contains human readable information about the key elements of the dataset.

  2. Each AOI has a separate folder within the root level. And in each AOI folder there is a STAC collection representing that AOI and additional json files for each data variable. The additional json files here too, are STAC Items and follow a similar naming convention to the ones at the root level. The assets for each AOI, i.e. the files containing actual data are also in the folder.

The diagram below summarises this hierarchical structure:

Root level (dataset)
│
│   DatasetCatalog.json
│   datasheet.md
│   references_output1.json
│   features_input1.json
│   ...
│
│
└───AOI 1
│      1.json (AOI Collection)
│      feature_input1.json
│      reference_output1.json
│      <reference_asset>
│      <feature_asseet>
│   
│   
└───AOI 2
│      ...
│   
│
└───AOI 3
│      ...
│   
...

Creating a STAC catalog with aireo_lib

The aireo_lib library makes it easier to generate the STAC metadata directory as defined above. Some of the useful functionalities in the library are:

Follow the code and comments below to understand the steps needed to generate STAC metadata with the library.

stac_generated folder

Prior to this step, you will need to create a folder called 'biomass_stac_generated' in your environment and leave it empty

Checking AIREO compiance level
Checking metadata completeness

Defining AOI class

TDS user

The user of the dataset can access most of what is offered by the dataset using just its STAC catalog. All he/she needs to do is create a dataset object by passing to it the path to the STAC catalog at the root level. The library automatically reads in all the metadata and loads the assets into the dataset object. Some of the functionalities that a dataset object offers through aireo_lib are:

Parsing TDS

Plotting

Statistics