Skip to content

Datasets

Concepts

A dataset can have multiple meanings depending on the context, in NINA DMS a dataset is:

A collection of resources that are managed together and share common metadata.

A dataset is composed of one or more resources.

The dataset metadata is based on the DataCite schema and describes at abstract level the dataset as a whole, while each resource can have its own metadata that describes the specific resource.

Where possible some of the dataset metadata can be computed, for example:

  • reference to users can be used, this avoids duplicating people information
  • references to other datasets can be created in the user interface
  • it's possible to automatically compute the spatial extent of the dataset based on the resources it contains

NOTE: it's still possible to enter register relationships to dataset/publications that are not in the system

Resources

A resource is a reference to something and must provide a uri that describes where the data can be found and the protocol to access it. A resource doesn't need to be accessible to be registered in the system, but it must have a meaningful uri.

Examples of resources with uri are:

  • A web accessible resource: https://path/to/the/file
  • A file: file:///path/to/the/file
  • An S3 object: s3://bucket-name/path/to/the/object
  • A database table: postgresql://host:port/dbname#table_name

Accessible resources can provide automatic metadata extraction using GDAL, this will allow to extract precise information about the resource like:

  • Spatial extent
  • Statistical information
  • Data schema

Additionally some resources can provide previews, at this moment only Cloud Optimized formats are supported for previews:

  • Cloud Optimized GeoTIFF (COG)
  • Parquet files

Registering resources

It's possible to registere a resource in two ways:

  • Manually, by providing the uri and other metadata
  • Automatically, by uploading a file using the upload mechanism, this will create a resource with a file uri.

When using the upload mechainism the file will be registered as a generic resource, you can change it later to a specific type.

NOTE: Changes to the resource uri or change of a resource type will trigger a metadata extraction if possible.

NOTE: Most of these operations are performed asynchronously, so it might take some time before the metadata is available.

It's possible to update a resource previously uploaded by uploading a same resource with the same filename

Resource Types

Generic Resource

A generic resource is a resource that doesn't have a specific type, it's just a reference to a uri.

examples:

  • A PDF document stored somewhere in P: file:///path/to/the/document.pdf
  • A web page: https://example.com/some/page.html
  • A githhub repository

Map Resource

A map resource is a reference to a map product, there are two main types of map resources:

  • Generic map resource: the URL of a published map (Example: an ArcGIS online map)
  • A NINA map: the configuration of a NINA map (JSON document)

In both cases a preview of the map will be shown.

Raster Resource

A raster resource is a reference to a raster file, if the file is http-accessible the DMS will use GDAL to extact the metadata and some approximate statistics about the raster.

Examples of raster resources:

  • A Cloud Optimized GeoTIFF (COG): https://path/to/the/file.cog.tif
  • A GeoTIFF file stored in P: file:///path/to/the/file.tif
  • Any other GDAL supported raster format

NOTE: only COG files will provide previews. Make sure to generate the COG with the appropriate overviews to have good performance.

Tabular Resource

A tabular resource is a reference to a tabular or vector data file, if the file is http-accessible the DMS will use GDAL to extact the metadata.

Examples of tabular resources:

  • A Parquet file: s3://bucket-name/path/to/the/file.parquet
  • A GeoPackage file stored in P: file:///path/to/the/file.gpkg
  • Any other GDAL supported vector format

NOTE: Avoid Shapefiles as they represent a single resource but they are composed of multiple files. NOTE: only Parquet files will provide an interactive preview.

Partitioned Resource

Partitioned resources are collections of files that share the same schema and represent a single logical dataset, but are composed of multiple physical files.

Examples of partitioned resources:

  • A partitioned Parquet dataset stored in S3: s3://bucket-name/path/year=2023/month=01/data.parquet
  • A hive partitioned dataset

NOTE: support for this resource type is not implemented yet.