Introduction

The notebook demonstrates the use of aireo_lib , a python library created as a part of the AIREO (Artificial Intelligence Ready Earth Observation training datasets) project. The project aims to make EO datasets easily accessible for the ML (Machine Learning) community. As such, AIREO specifications (shorthand specs) which define metadata elements to be included with the training dataset are proposed, supplemented by a best-practices document which suggests how to fill those metadata elements. Finally, the library takes all into account and implements specs, best-practices and offers an easy-to-use pythonic interface bridging the gap between EO and ML community.

Therefore, this notebook is divided into two sections, one for the training dataset creator (usually from the EO community) and the other for its user (usually from the ML community). The structure of the notebook is the following:

1) For Creator

- Create a [STAC](https://stacspec.org/) catalog object using the library

- Populate metadata elements prescribed by the AIREO specs

- Generate a STAC metadata directory using the library

- Check AIREO compliance level and metadata completeness


2) For User

- Create a training dataset object as defined in the library using only the STAC metadata

- Get example instances from the object and other dataset variables like the number of instances, etc.

- Use library's functions to plot the data

- Investigate statistics using the library

About the training dataset

The CAP (Common Agricultural Policy) dataset, contains crop fields in Austria with different crop types. We have divided all of Austria in rectangular AOIs (Area if Interests) and downloaded Sentinel 2 images at 10 m resolution for each of those, they have been saved as geotiff files. The field shapes and the crop types have been converted to raster masks over these Sentinel 2 images, with each crop denoted by a different number, these are also stored as geotiffs. Only top 10 crop types which occupied almost 85% of the area are used. Hope this information will help you getting started and more can be found in various metadata elements below.

Dataset Creator

AIREO STAC Catalog basics

The AIREO specs propose a hierarchical structure for STAC metadata. It is a two level structure where the dataset is represented by a collection of AOIs (Area Of Interests), hence, the dataset and AOI being the two levels.

  1. At the dataset level we have a dataset catalog whose metadata elements are the core elements proposed in the AIREO spec. In addition to it, the common metadata elements across each AOI are also at the dataset level, which we shall call root level henceforth. Here, for each data variable there is a separate json which is a STAC Item by definition and is named using the field_schema metadata element. Additionally, there is also a datasheet file in markdown format at the root level which contains human readable information about the key elements of the dataset.

  2. Each AOI has a separate folder within the root level. And in each AOI folder there is a STAC collection representing that AOI and additional json files for each data variable. The additional json files here too, are STAC Items and follow a similar naming convention to the ones at the root level. The assets for each AOI, i.e. the files containing actual data are also in the folder.

The diagram below summarises this hierarchical structure:

Root level (dataset)
│
│   DatasetCatalog.json
│   datasheet.md
│   references_output1.json
│   features_input1.json
│   ...
│
│
└───AOI 1
│      1.json (AOI Collection)
│      feature_input1.json
│      reference_output1.json
│      <reference_asset>
│      <feature_asseet>
│   
│   
└───AOI 2
│      ...
│   
│
└───AOI 3
│      ...
│   
...

Creating a STAC catalog with aireo_lib

The aireo_lib library makes it easier to generate the STAC metadata directory as defined above. Some of the useful functionalities in the library are:

Follow the code and comments below to understand the steps needed to generate STAC metadata with the library.

stac_generated folder creation

Prior to this step, you will need to create a folder called 'biomass_stac_generated' in your environment and leave it empty

Checking AIREO compliance level

Checking metadata completeness

Defining AOI class

For enabling many other functionalities of the library the dataset creator needs to create an AOI class which defines how the asset files are loaded, can return an example and length of the dataset. The blueprint is given in the library by the AOIDataset class which this class should inherit. In the future, it is planned to automate the creation of the AOI class also.

Dataset user

The user of the dataset can access most of what is offered by the dataset using just its STAC catalog. All he/she needs to do is create a dataset object by passing to it the path to the STAC catalog at the root level. The library automatically reads in all the metadata and loads the assets into the dataset object. Some of the functionalities that a dataset object offers through aireo_lib are:

Parsing the dataset by creating a dataset object

Plotting functions in aireo_lib

Statistics functions in aireo_lib

ML model

We try to demonstrate here how the library interfaces with existing ML libraries. This is an over simplistic implementation of resnet50 from pytorch to show this functionality of our library. We train the model for a few epochs with a few instances from our dataset object. The choices made here like loss function, optimizer, training indexes, etc. are only meant for demonstration and are in no way recommended and do not claim to be sensible choices.