Datasets

Concepts

A dataset can have multiple meanings depending on the context, in NINA DMS a dataset is:

A collection of resources that are managed together and share common metadata.

A dataset is composed of one or more resources.

The dataset metadata is based on the DataCite schema and describes at abstract level the dataset as a whole, while each resource can have its own metadata that describes the specific resource.

Where possible some of the dataset metadata can be computed, for example:

reference to users can be used, this avoids duplicating people information
references to other datasets can be created in the user interface
it's possible to automatically compute the spatial extent of the dataset based on the resources it contains

NOTE: it's still possible to enter register relationships to dataset/publications that are not in the system

Resources

A resource is a reference to something and must provide a uri that describes where the data can be found and the protocol to access it. A resource doesn't need to be accessible to be registered in the system, but it must have a meaningful uri.

Examples of resources with uri are:

A web accessible resource: https://path/to/the/file
A file: file:///path/to/the/file
An S3 object: s3://bucket-name/path/to/the/object
A database table: postgresql://host:port/dbname#table_name

Accessible resources can provide automatic metadata extraction using GDAL, this will allow to extract precise information about the resource like:

Spatial extent
Statistical information
Data schema

Additionally some resources can provide previews, at this moment only Cloud Optimized formats are supported for previews:

Cloud Optimized GeoTIFF (COG)
Parquet files

Registering resources

It's possible to registere a resource in two ways:

Manually, by providing the uri and other metadata
Automatically, by uploading a file using the upload mechanism, this will create a resource with a file uri.

When using the upload mechainism the file will be registered as a generic resource, you can change it later to a specific type.

NOTE: Changes to the resource uri or change of a resource type will trigger a metadata extraction if possible.

NOTE: Most of these operations are performed asynchronously, so it might take some time before the metadata is available.

It's possible to update a resource previously uploaded by uploading a same resource with the same filename

Resource Types

Generic Resource

A generic resource is a resource that doesn't have a specific type, it's just a reference to a uri.

examples:

A PDF document stored somewhere in P: file:///path/to/the/document.pdf
A web page: https://example.com/some/page.html
A githhub repository

Map Resource

A map resource is a reference to a map product, there are two main types of map resources:

Generic map resource: the URL of a published map (Example: an ArcGIS online map)
A NINA map: the configuration of a NINA map (JSON document)

In both cases a preview of the map will be shown.

Raster Resource

A raster resource is a reference to a raster file, if the file is http-accessible the DMS will use GDAL to extact the metadata and some approximate statistics about the raster.

Examples of raster resources:

A Cloud Optimized GeoTIFF (COG): https://path/to/the/file.cog.tif
A GeoTIFF file stored in P: file:///path/to/the/file.tif
Any other GDAL supported raster format

NOTE: only COG files will provide previews. Make sure to generate the COG with the appropriate overviews to have good performance.

Tabular Resource

A tabular resource is a reference to a tabular or vector data file, if the file is http-accessible the DMS will use GDAL to extact the metadata.

Examples of tabular resources:

A Parquet file: s3://bucket-name/path/to/the/file.parquet
A GeoPackage file stored in P: file:///path/to/the/file.gpkg
Any other GDAL supported vector format

NOTE: Avoid Shapefiles as they represent a single resource but they are composed of multiple files. NOTE: only Parquet files will provide an interactive preview.

Partitioned Resource

Partitioned resources are collections of files that share the same schema and represent a single logical dataset, but are composed of multiple physical files.

Examples of partitioned resources:

A partitioned Parquet dataset stored in S3: s3://bucket-name/path/year=2023/month=01/data.parquet
A hive partitioned dataset

NOTE: support for this resource type is not implemented yet.