Preface to the data management handbook

This data management handbook comes from an initiave started by MET and that has been funded by the S-ENDA project

1. Introduction

The purpose of the Data Management Handbook (DMH) is threefold:

to provide an overview of the principles for data management to be employed;
to help personnel identify their roles and responsibilities for good data management; and
to provide personnel with practical guidelines for carrying out good data management.

Data management is the term used to describe the handling of data in a systematic and cost-effective manner. The data management regime should be continuously evolving, to reflect the evolving nature of data collection. Therefore this DMH is a living document that will be revised and updated from time to time in order to maintain its relevance.

The DMH is a strategic governing document and should be used as part of the quality framework the organisation is using.

The primary focus of this DMH is on the management of dynamic geodata. Dynamic geodata is weather, environment and climate-related data that changes in space and time and is thus descriptive of processes in nature. Examples are weather observations, weather forecasts, pollution (environmental toxins) in water, air and sea, information on the drift of cod eggs and salmon lice, water flow in rivers, driving conditions on the roads and the distribution of sea ice. Dynamic geodata provides important constraints for many decision-making processes and activities in society.

This introduction (Introduction) lays forth the background and principles for the data management regime. Chapters 2-5 describe the implementation of the main building blocks: structuring and documenting data (???), data services (???), portals and documentation aimed at users (???), and governance issues (???). Each chapter starts with a brief statement of its purpose, followed by a description of what is implemented at present, as well as the planned developments for the short-term (<2 years) and expected developments for the longer term (2-5 years).

Practical guidelines for carrying out good data management are addressed in Chapter 6.

The intended audience for this DMH is any personell involved in the process of making data available for the end user. This process can be viewed as a value chain that moves from the producer of the data to the data consumer (i.e., the end user). The process is further described in The data value chain (???).

The handbook can be used in three ways:

Read Introduction to learn about the background and principles of data management;
Read Chapters 2-5 to learn about how data management is currently implemented and how it is expected to evolve in the next few years; and
Read ??? to find how-to’s and practical guidelines.

1.1 The principles of data management for dynamic geodata

Principles of standardised data documentation, publication, sharing and preservation have been formalised in the FAIR Guiding Principles for scientific data management and stewardship [RD3] through a process facilitated by FORCE11. FAIR stands for findability, accessibility, interoperability and reusability.

By following the FAIR principles it is easier to obtain a common approach to data management, or a unified data management model. One of the main motivations for implementing a unified data management is to better serve the users of the data. Primarily, this can be approached by making user needs and requirements the guide for determining what data we provide and how. For example, it will be described below how the specification of datasets should be determined. By implementing the data management practices described here, it is expected that users will experience:

Ease of discovering, viewing and accessing datasets;
Standardised ways of accessing data, including downloading or streaming data, with reduced need for special solutions on the user side;
Reduced storage needs;
Simple and standard access to remote datasets and catalogues, with own data visualisationl and analysis tools;
Ability to compare and combine data from internal and external sources;
Ability to apply common data transformations, like spatial, temporal and variables subsetting and reprojection, before downloading anything;
Possibility to build specialized metadata catalogues and data portals targeting a specific user community.

1.1.1 External data management requirements and forcing mechanisms

Any organisation that strives to implement FAIR data management model has to relate to external forcing mechanisms concerning data management at several levels. At the national level, the organisation must comply with national regulations as decided by the government. Some of these are indications of expected behaviour (e.g., OECD regulations) and some are implemented through a legal framework. The Norwegian government has over time promoted free and open sharing of public data. Mechanisms for how to do this are governed by the Geodataloven (implemented as Geonorge), which is a national implementation of the European INSPIRE directive (to be amended in 2019). INSPIRE defines a federated multinational Spatial Data Infrastructure (SDI) for the European Union, similar to NSDI in the USA or UNSDI under the United Nations. The goal is to provide a standardised access to data and provide the necessary tools to be able to work with the data in a unified manner. In short, these legal frameworks require standardised documentation (at discovery and use level; these concepts are described later) and access (through specified protocols) to the data identified.

Other external requirements and forcing mechanisms that are organisation-specific are provided in ???.

1.1.2 The data value chain

The process of getting the data from the data producer to the consumer can be viewed as a value chain. An example of a data value chain is presented in ???. Typically, data from a wide variety of providers are used in the value chain. Traditionally, the data used have been transmitted on request from one data centre to another, and used in the specific processing chains that requested the data. The focus on reuse of data in various contexts has been missing.

Value chain for data.

Datasets and metadata are what travels through the value chain. At the end of the data management value chain are the data consumers.

1.1.3 Dataset

A dataset is a collection of data. In the context of the data management model, the storage mode of the dataset is irrelevant, since access mechanisms can be decoupled from the storage layer as experienced by a data consumer. Typically, a dataset represents a number of variables in time and space. A more detailed definition is provided in the Glossary of Terms. In order to best serve the data through web services, the following principles are useful for guiding the dataset definition:

A dataset can be a collection of variables stored in, for example, a relational database or as flat files;
A dataset is defined as a number of spatial and/or temporal variables;
A dataset should be defined by the information content and not the production method;
A good dataset does not mix feature types, i.e., trajectories and gridded data should not be present in the same dataset.

Point 3 implies that the output of, e.g., a numerical model may be divided into several datasets that are related. This is also important in order to efficiently serve the data through web services. For instance, model variables defined on different vertical coordinates should be separated as linked datasets, since some OGC services (e.g., WMS) are unable to handle mixed coordinates in the same dataset. One important linked dataset relation is the parent-child relationship. In the numerical model example, the parent dataset would be the model simulation. This (parent) dataset encompasses all datasets created by the model simulation such as, e.g., two NetCDF-CF files (child datasets) with different information content.

Most importantly, a dataset should be defined to meet the consumer needs. This means that the specification of a dataset should follow not only the content guidelines just listed, but also address the consumer needs for data delivery, security and preservation.

1.1.4 Metadata

Metadata is a broad concept. In our data management model the term "metadata" is used in several contexts, specifically the five categories that are briefly described in table_title.

Type	Purpose	Description	Examples
Discovery metadata	Used to find relevant data	Discovery metadata are also called index metadata and are a digital version of the library index card. They describe who did what, where and when, how to access data and potential constraints on the data. They shall also link to further information on the data, such as site metadata.	ISO 19115, GCMD
Use metadata	Used to understand data found	Use metadata describes the actual content of a dataset and how it is encoded. The purpose is to enable the user to understand the data without any further communication. They describe the content of variables using standardised vocabularies, units of variables, encoding of missing values, map projections, etc.	Climate and Forecast, Convention, BUFR, GRIB
Site metadata	Used to understand data found	Site metadata are used to describe the context of observational data. They describe the location of an observation, the instrumentation, procedures, etc. To a certain extent they overlap with discovery metadata, but also extend discovery metadata. Site metadata can be used for observation network design. Site metadata can be considered a type of use metadata.	WIGOS, OGC O&M
Configuration metadata	Used to tune portal services for datasets intended for data consumers (e.g., WMS)	Configuration metadata are used to improve the services offered through a portal to the user community. This can, e.g., be how to best visualise a product.
System metadata	Used to understand the technical structure of the data management system and track changes in it	System metadata covers, e.g., technical details of the storage system, web services, their purpose and how they interact with other components of the data management system, available and consumed storage, number of users and other KPI elements etc.

The tools and facilities used to manage the information for efficient discovery and use are further described in ???.

1.1.5 A data management model based on the FAIR principles

The data management model is built upon the following principles:

Standardisation – compliance with established international standards;
Interoperability – enabling machine-to-machine interfaces including standardised documentation and encoding of data;
Integrity – ensuring that data and data access can be maintained over time, and ensuring that the consumer receives the same data at any time of request;
Traceability – documentation of the provenance of a dataset, i.e., all actions taken to produce and maintain the dataset and the usage of the data in downstream systems;
Modularisation – enabling replacement of one component of the system without necessitating other changes.

The model’s basic functions fall into three main categories:

Documentation of data using discovery and use metadata. The documentation identifies who, what, when, where, and how, and shall make it easy for consumers to find and understand data. This requires application of information containers and utilisation of controlled vocabularies and ontologies where textual representation is required. It also covers the topic of data provenance which is used to describe the origin and all actions done on a dataset. Data provenance is closely linked with workflow management. Furthermore, it covers the relationship between datasets. Application of ontologies in data documentation is closely linked to the concept of linked data.
Publication and sharing of data focuses on making data accessible to consumers internally and externally. Application of standardised approaches is vital, along with cost effective solutions that are sustainable. Direct integration of data in applications for analysis through data streaming minimises the complexity and overhead in dissemination solutions. This category also covers persistent identifiers for data.
Preservation of data includes short and long term management of data, which secures access and availability throughout the lifespan of the data. Good solutions in this area depend on expected and actual usage of the data. Preservation of data includes the concept of data life cycle, i.e., the documented flow of data from initial storage through to obsolescence and permanent archiving (or deletion) and preserving the metadata for the same data (even after deleting).

1.2. Human roles in data management

1.2.1 Context

Data is processed and interpreted to generate knowledge (e.g., about the weather) for end users. The knowledge can be presented as information in the form of actual data, illustrations, text or other forms of communication. In this context, an illustration is a representation of data, whereas data means the numerical values needed to analyse and interpret a natural process (i.e., calibrated or with calibration information; it must be possible to understand the meaning of the numerical value from the available and machine-readable information).

Information to knowledge

Definition

Data here means the numerical values needed to analyse and interpret a natural process (i.e., calibrated or with calibration information, provenance, etc.; it must be possible to understand the meaning of the numerical value from the available and machine-readable information).

Advanced users typically consume some type of data in order to process and interpret it, and produce new knowledge, e.g., in the form of a new dataset or other information. The datasets can be organised in different levels, such as the WMO WIGOS definition for levels of data. Less advanced users apply information based on data (e.g., an illustration) to make decisions (e.g., clothing adapted to the forecast weather).

1.2.2 User definitions

We define two types of users:

Producers: Those that create / produce data
Consumers: Those that consume data

A consumer of one level of data is typically a producer of data at the next level. A user can both consume data and produce data, or just have one of these roles (e.g., at the start/end of the production chain).

Information to knowledge

1.2.2.1 Data consumers

We split between three types of data consumers: (1) advanced, (2) intermediate, and (3) simple. These are defined below.

1.2.2.1.1 Advanced consumers

Advanced consumers require information in the form of data and metadata (including provenance) to gain a full understanding of what data exists and how to use it (discovery and use metadata), and to automatize the generation of derived data (new knowledge generation), verification (of information), and validation of data products.

Example questions:

Need all historical weather data, that can be used to model / predict the weather in the future

Specific consumers:

Researcher (e.g., for climate projections within the "Klima i Norge 2100" research project)

1.2.2.1.2 Intermediate consumers

Intermediate consumers need enough information to find data and understand if it can answer their question(s) (discovery and use metadata). Also, they often want to cross reference a dataset with another dataset or metadata for inter-comparative verification of information.

Example questions:

Is this observation a record / weather extreme (coldest, warmest, wettest)?
What was the amount of rain in last month in a certain watershed?

Specific consumers:

Klimavakt (MET)
Developer (app, website, control systems, machine learning, etc.)
Energy sector (hydro, energy prices)
External partners

1.2.2.1.3 Simple consumers

Simple consumers do not have any prior knowledge about the data. Information in the form of text or illustrations is sufficient for their decision making. They do not need to understand either data or metadata, and they are most likely looking for answers to simple questions.

Example questions:

Will it be raining today?
Can the event take place, or will the weather impeed it?
When should I harvest my crops?

Specific consumers:

Event organizer
Journalist
Farmer, or other people who work with the land like tree planters

Note

An advanced consumer may discover information pertaining a role as a simple consumer. Such a user may, for some reason, be interested in tracking the data in order to use it together with other data (interoperability) or to verify the information. Therefore, it is important to have provenance metadata pointing to the basic data source(s) also at the simplest information level.

1.2.2.2 Data producers

A producer is an advanced consumer at one level of data that generate new information at a higher level. This new information could be in the form of actual data or simple information, such as an illustration or a text summary. It is essential that any information can be traced back to the source(s).

1.2.2.3 Data Management Roles

Between the data providers and data consumers are the processes that manage and deliver the datasets (cf. ???). A number of human roles may be defined with responsibilities that, together, ensure that these processes are carried out in accordance with the data management requirements of the organisation. The definition and filling of these roles depend heavily on the particular organisation, and each organisation must devise its own best solution.

1.3 Introduction to the data management at (insert organisation name here)

1.3.1 Background at (insert organsitation name here)

1.3.2 External data management requirements and forcing mechanisms specific to (insert organisation name here)

1.3.3 Data Management roles at (insert organisation name here)

1.4 Summary of data management requirements

The data management regime described in this DMH follows the Arctic Data centre model and shall ensure that:

There are relevant metadata for all datasets, and both data and metadata are available in a form and in such a way that they can be utilised by both humans and machines
- There are sufficient metadata for each dataset for both discovery and use purposes
- Discovery metadata are indexed and can be retrieved from available services in a standard way and with standard protocols
- There are interfaces for discovery, visualisation and download, as well as portals for human access, that operate seamlessly across institutions
- The data are described in a relevant, standardised and managed vocabulary that supports machine-machine interfaces
- Datasets have attached a unique and permanent identifier that enables traceability
- Datasets have licensing that ensures free use and reuse wherever possible
- Datasets are available for download in a standard form according to the FAIR guiding principles and through standard protocols that are accepted and utilised in the user environment
- There are authentication and authorisation mechanisms that ensure access control to data with restrictions, and that are compatible with and coupled to relevant public authentication solutions (FEIDE, eduGAIN, Google, etc.)
There is an organisation that provides for the management of each dataset throughout its lifetime (life cycle management)
- There is documentation that describes physical storage, lifetime of each dataset, degree of storage redundancy, metadata consistency methods, how dataset versioning is implemented and unique IDs to ensure traceability
- The organisation provides seamless access to data from distributed data centres through various portals
- The above and a business model at dataset level are described in a Data Management Plan (DMP)
There are services or tools that provide the following functionalities on the datasets:
- Transformations
  - Subsetting
  - Slicing of gridded datasets to points, sections, profiles
  - Reprojection
  - Resampling
  - Reformatting
- Visualisation (time series, mapping services, etc.)
- Aggregation
- Upload of new datasets (including enabling and configuring data access services)

2. Structuring and Documenting Data for Efficient Discovery and Use

2.1 Purpose

In order to properly find, understand and use geophysical data, standardised encoding and documentation are required, i.e., metadata.

Both discovery metadata and use metadata can be embedded in the files produced for a dataset through utilisation of self-explaining file formats. If properly done by the data producer, publication and preservation of data through services is simplified and can be automated.

2.2 Implementation

An essential prerequisite for structuring and documenting data is the specification of the dataset(s), cf. ???. The dataset is the basic building block of our data management model; all the documentation and services described in this DMH are built on datasets. The dataset specification is the first step in structuring one’s data for efficient management, and it is mandatory.

2.3 Structuring and documenting data at [insert organisation here]

2.3.1 Current practice in structuring and documenting data

Table 2. Data types available at [insert institution here], with the fileformats supported. The primary fileformat is marked in bold

Supported file formats	Datatype	Available metadata	Examples
Comments

2.3.2 Planned developments in the near-term (< 2 years)

2.3.3 Expected evolution in the longer term (> 2 years)

3. Data Services

3.1 Purpose

The purpose of this chapter is to describe services that benefit from the standardisation performed in the previous step. The information structures described in the previous chapter pave the way for efficient data discovery and use through tools and automated services. Implementation of the services must be in line with the institute’s delivery architecture.

Data services include:

Data ingestion, storing the data in the proper locations for long term preservation and sharing;
Data cataloging, extracting the relevant information for proper discovery of the data;
Configuration of visualisation and data publication services (e.g., OGC WMS and OpeNDAP).

3.2 Implementation

When planning and implementing data services, there are a number of external requirements that constrain choices, especially if reuse of solutions nationally and internationally is intended and wanted for the data in question. At the national level, important constraints are imposed by the national implementation of the INSPIRE directive through Norge digitalt.

3.2.1 Legacy

3.2.2 Planned developments in near-term (< 2 years)

3.2.3 Expected evolution in the longer term (> 2 years)

4. User Portals and Documentation

4.1 Purpose

The purpose of this chapter is to describe the human interfaces which data consumers would use to navigate data and the related services. A portal is an entry point for data consumers, enabling them to discover and search for datasets and services, and providing sufficient documentation and guidance to ensure that they are able to serve themselves using the interactive and machine interfaces offered.

Here, we can distinguish between a general portal for all publishable datasets from the institution and targeted portals that offer a focused selection of data, which may include external datasets. Targeted portals cater to specific user groups and may have a limited lifetime, but also can be long-term commitments.

4.2 Implementation of user portals at NINA

Table 3: User portals in use at NINA

User portal	Description	General or targeted portal	Data consumer
Living Norway data portal	Catelogue and endpoint for data/metadata of projects associated with the Living Norway project		Biodiversity researchers and decision makers

4.2.1 Current implementation

4.2.2 Planned developments in near-term (< 2 years)

4.2.3 Expected evolution in the longer term (> 2 years)

5. Data Governance

5.1 Purpose

This chapter describes how we organise and steer data management activities in order to ensure that:

The guidelines described above are implemented throughout the organisation;
Our data management practices are in line with and contribute to the institute’s strategic aims;
Our data management regime is subject to review, analysis and revision in a timely manner.

These higher level aspects of data management are often referred to as data governance. A useful definition is:

"Data governance … is the overall management of the availability, usability, integrity and security of data used in an enterprise. A sound data governance program includes a governing body or council, a defined set of procedures and a plan to execute those procedures."

In this chapter we address many aspects of this definition, but a full description of data governance touches on management structures that are beyond the scope of this handbook.

5.2 Data life cycle management

Data life cycle management is steered by documentation describing how data generated or used in an activity will be handled throughout the lifetime of the activity and after the activity has been completed. This is living documentation that follows the activity and specifies what kind of data will be generated or acquired, how the data will be described, where the data will be stored, whether and how the data can be shared, and how the data will be retired (archived or deleted). The purpose of life cycle management is to safeguard the data, not just during their “active” period but also for future reuse of the data, and to facilitate cost-effective data handling.

This DMH recommends the following concepts of life cycle management to be implemented for the institution:

An institution specific Data Management Handbook (DMH) based on a common general template;
Extended discovery metadata for data in internal production chains (these are metadata elements that provide the necessary information for life cycle management just described); and
A Data Management Plan (DMP) document (a DMP is expected for datasets produced in external projects, but may also be useful for internal datasets, as a supplement to the extended discovery metadata).

The goal is that life cycle management information shall be readily available for every dataset managed by the institute. How these concepts are implemented are described in the subsections below.

5.2.1 Data Management Plan

A Data Management Plan (DMP) is a document that describes textually how the data life cycle management will be carried out for datasets used and produced in specific projects. Generally, these are externally financed projects for which such documentation is required by funding agencies. However, larger internal projects covering many datasets may also find it beneficial to create a specific document of this type.

Currently, agencies funding R&D (such as NFR and the EU) do not strictly require a DMP from the start of any project. However, for projects in the geosciences, data management is an issue that must be addressed, and the agencies strongly recommend a DMP solution. For example, NFR publishes guidelines for the contents of a DMP, including links to tools (templates and online services); these guidelines are recommended for any data management project or activity and will in time become a requirement according to NFR.

5.3 Data governance at NINA

5.3.1 Current implementation

5.3.1.1 Organisational Roles

5.3.1.2 Status DMH

5.3.1.3 Status Discovery metadata

5.3.1.3 Status DMP

5.3.2 Planned developments in the near-term (< 2 years)

Revise DMH annually or when needed.

5.3.3 Expected evolution in the longer term (> 2 years)

Revise DMH annually or when needed.

6. Practical Guides

This chapter includes how-to’s and other practical guidance for data producers.

6.1 Create a Data Management Plan (DMP)

The funding agency of your project will usually provide requirements, guidelines or a template for the DMP. If this is not the case or for datasets that are not part of a project use the template provided by your institution or the template based on the recommendations by Science Europe.

6.1.1 Using easyDMP

Log in to easyDMP, use Dataporten if your institution supports that, otherwise pick one of the other login methods.
Click on + Create a new plan and pick a template
By using the Summary button from page two and on, you can get an overview of all the questions.

6.1.2 Publishing the plan

Currently you can use the export function in easyDMP to download an HTML or PDF version of the DMP and use it further. This might change if "Hosted DMP" gets implemented.

6.2 Submitting data as NetCDF-CF

6.2.1 Workflow

Define your dataset (see dataset and ???)
Create a NetCDF-CF file (see Creating NetCDF-CF files)
- Add discovery metadata as global attributes (see global attibutes section)
- Add variables and use metadata following the CF conventions
Store the NetCDF-CF file in a suitable location, and distribute it via thredds or another dap server (see, e.g., How to add NetCDF-CF data to thredds)
Register your dataset in a searchable catalog (see How to register your data in the catalog service)
- Test that your dataset contains the necessary discovery metadata and create an MMD xml file (see Generation of MMD xml file from NetCDF-CF)
- Test the MMD xml file (see Test the MMD xml file)
- Push the MMD xml file to the discovery metadata catalog (see Push the MMD xml file to the discovery metadata catalog)

6.2.2 Creating NetCDF-CF files

By documenting and formatting your data using NetCDF following the CF conventions and the Attribute Convention for Data Discovery (ACDD), MMD files can be automatically generated from the NetCDF files. The CF conventions is a controlled vocabulary providing a definitive description of what the data in each variable represents, and the spatial and temporal properties of the data. The ACDD vocabulary describes attributes recommended for describing a NetCDF dataset to data discovery systems. See, e.g., netCDF4-python docs, or xarray docs for documentation about how to create netCDF files.

The ACDD recommendations should be followed in order to properly document your netCDF-CF files. The below tables summarize required and recommended ACDD and some additional attributes that are needed to properly populate a discovery metadata catalog which fulfills the requirements of international standards (e.g., GCMD/DIF, the INSPIRE and WMO profiles of ISO19115, etc.).

6.2.2.1 Notes

Keywords describe the content of your dataset following a given vocabulary. You may use any vocabularies to define your keywords, but a link to the keyword definitions should be provided in the ``keywords_vocabulary`` attribute. This attribute provides information about the vocabulary defining the keywords used in the ``keywords`` attribute. Example:

:keywords_vocabulary = "GCMDSK:GCMD Science Keywords:https://gcmd.earthdata.nasa.gov/kms/concepts/concept_scheme/sciencekeywords, GEMET:INSPIRE Themes:http://inspire.ec.europa.eu/theme, NORTHEMES:GeoNorge Themes:https://register.geonorge.no/metadata-kodelister/nasjonal-temainndeling" ;

Note that the GCMDSK, GEMET and NORTHEMES vocabularies are required for indexing in S-ENDA and Geonorge. You may find appropriate keywords at the following links:

The keywords should be provided by the ``keywords`` attribute as a comma separated list with a short name defining the vocabulary used, followed by the actual keyword, i.e., ``short_name:keyword``. Example:

:keywords = "GCMDSK:Earth Science > Atmosphere > Atmospheric radiation, GEMET:Meteorological geographical features, GEMET:Atmospheric conditions, NORTHEMES:Weather and climate" ;

See https://adc.met.no/node/96 for more information about how to define the ACDD keywords.

A data license provides information about any restrictions on the use of the dataset. To support a linked data approach, the ``license`` element should be supported by a ``license_resource`` element, providing a link to the license definition. Example:

:license = "CC-BY-4.0" ;
:license_resource = "http://spdx.org/licenses/CC-BY-4.0" ;

6.2.2.2 List of Attributes

This section provides lists of ACDD elements that are required and recommended, as well as some extra elements that are needed to fully support our data management needs. The right columns of these tables provide the MET Norway Metadata Specification (MMD) fields that map to the ACDD (and our extension to ACDD) elements. Please refer to MMD for definitions of these elements, as well as controlled vocabularies that should be used. Note that the below tables are automatically generated - check https://github.com/metno/py-mmd-tools/blob/master/py_mmd_tools/mmd_elements.yaml if anything is unclear.

In order to check your netCDF-CF files, and to create MMD xml files, you can use the nc2mmd.py script in the py-mmd-tools Python package.

The following ACDD elements are required:

ACDD Attribute	MMD equivalent	Comment
id	metadata_identifier	Required, and should be UUID. No repetition allowed.
naming_authority	metadata_identifier	Required. We recommend using reverse-DNS naming. No repetition allowed.
date_created	last_metadata_update>update>datetime	Format as ISO8601.
title	title>title	Use ACDD extension "title_no" for Norwegian translation.
summary	abstract>abstract	Use ACDD extension "summary_no" for Norwegian translation.
time_coverage_start	temporal_extent>start_date	Comma separated list.
geospatial_lat_max	geographic_extent>rectangle>north	No repetition allowed.
geospatial_lat_min	geographic_extent>rectangle>south	No repetition allowed.
geospatial_lon_max	geographic_extent>rectangle>east	No repetition allowed.
geospatial_lon_min	geographic_extent>rectangle>west	No repetition allowed.
keywords	keywords>keyword	Comma separated list.
keywords_vocabulary	keywords>vocabulary	Comma separated list.

The following ACDD elements are (highly) recommended:

ACDD Attribute	Default	MMD equivalent	Comment
date_metadata_modified		last_metadata_update>update>datetime	Format as ISO8601. Comma separated list if more than once.
time_coverage_end		temporal_extent>end_date	Comma separated list.
geospatial_bounds		geographic_extent>polygon	No repetition allowed.
processing_level		operational_status	No repetition allowed. See the MMD docs for valid keywords.
license		use_constraint>identifier	No repetition allowed.
creator_role	Investigator	personnel>role	Comma separated list.
contributor_role	Investigator	personnel>role	Comma separated list.
creator_name	Not available	personnel>name	Comma separated list.
contributor_name	Not available	personnel>name	Comma separated list.
creator_email	Not available	personnel>email	Comma separated list.
creator_institution	Not available	personnel>organisation	Comma separated list.
institution		data_center>data_center_name>long_name	Comma separated list.
publisher_url		data_center>data_center_url	Comma separated list.
project		project>long_name	Semicolon separated list.
platform		platform>long_name	Comma separated list.
platform_vocabulary		platform>resource	Comma separated list.
instrument		platform>instrument>long_name	Comma separated list.
instrument_vocabulary		platform>instrument>resource	Comma separated list.
source		activity_type	Semicolon separated list.
creator_name		dataset_citation>author	Comma separated list.
date_created		dataset_citation>publication_date	Comma separated list.
title		dataset_citation>title
publisher_name		dataset_citation>publisher	Comma separated list.
metadata_link		dataset_citation>url	Comma separated list.
references		dataset_citation>other	Comma separated list.

The following elements are ACDD extensions that are needed to improve (meta)data interoperability. Please refer to the documentation of MMD for more details:

Necessary non-ACDD Attribute	Default	MMD equivalent	Comment
spatial_representation		spatial_representation	No repetition allowed.
alternate_identifier		alternate_identifier>alternate_identifier	Alternative identifier for the dataset (but not DOI). Comma separated list.
alternate_identifier_type		alternate_identifier>type	Identification of the type of identifier used. Comma separated list.
date_metadata_modified_type		last_metadata_update>update>type	E.g., major or minor modification. Comma separated list.
date_created_type	Created	last_metadata_update>update>type
title_no		title>title	Used for Norwegian version of the title.
title_lang	en	title>lang	ISO language code.
summary_no		abstract>abstract	Used for Norwegian version of the abstract.
summary_lang	en	abstract>lang	ISO language code.
dataset_production_status	Complete	dataset_production_status	No repetition allowed.
access_constraint		access_constraint	No repetition allowed.
license_resource		use_constraint>resource	No repetition allowed.
contributor_email	Not available	personnel>email	Comma separated list.
contributor_institution		personnel>organisation
contributor_organisation		personnel>organisation
institution_short_name		data_center>data_center_name>short_name	Comma separated list.
related_dataset_id		related_dataset>related_dataset	Comma separated list.
related_dataset_relation_type		related_dataset>relation_type	Comma separated list.
iso_topic_category		iso_topic_category	Comma separated list.
project_short_name		project>short_name	Semicolon separated list.
quality_control		quality_control	No repetition allowed.
doi		dataset_citation>doi

6.2.3 How to add NetCDF-CF data to thredds

This section should contain institution specific information about how to add netcdf-cf files to thredds.

6.2.4 How to register your data in the catalog service

In order to make a dataset findable, a dataset must be registered in a searchable catalog with appropriate metadata. The (meta)data catalog is indexed and exposed through CSW.

The following needs to be done:

Generate an MMD xml file from your NetCDF-CF file (see Generation of MMD xml file from NetCDF-CF)
Test your mmd xml metadata file (see Test the MMD xml file)
Push the MMD xml file to the discovery metadata catalog (see Push the MMD xml file to the discovery metadata catalog)

6.2.4.1 Generation of MMD xml file from NetCDF-CF

Clone the py-mmd-tools repo and make a local installation with eg pip install .. This should bring in all needed dependencies (we recommend to use a virtual environment).

Then, generate your mmd xml file as follows:

cd script ./nc2mmd.py -i <your netcdf file> -o <your xml output directory></programlisting>

See ./nc2mmd.py --help for documentation and extra options.

You will find Extensible Stylesheet Language Transformations (XSLT) documents in the MMD repository. These can be used to translate the metadata documents from MMD to other vocabularies, such as ISO19115:

./bin/convert_from_mmd -i <your mmd xml file> -f iso -o <your iso output file name></programlisting>

Note that the discovery metadata catalog ingestion tool will take care of translations from MMD, so you don’t need to worry about that unless you have special interest in it.

6.2.4.2 Test the MMD xml file

Install the dmci app, and run the usage example locally. This will return an error message if anything is wrong with your MMD file.

6.2.4.3 Push the MMD xml file to the discovery metadata catalog

For development and verification purposes:

curl --data-binary "@<PATH_TO_MMD_FILE>" https://dmci-*.s-enda.k8s.met.no/v1/insert</programlisting>

where * should be either dev or staging.

For production (the official catalog):

curl --data-binary "@<PATH_TO_MMD_FILE>" https://dmci.s-enda.k8s.met.no/v1/insert</programlisting>

6.3 Searching data in the Catalog Service for the Web (CSW) interface

6.3.1 Using OpenSearch

6.3.1.1 Local test machines

The vagrant-s-enda environment found at vagrant-s-enda provides OpenSearch support through PyCSW. To test OpenSearch via the browser, start the vagrant-s-enda vm (vagrant up) and go to the following address:

http://10.10.10.10/pycsw/csw.py?mode=opensearch&service=CSW&version=2.0.2&request=GetCapabilities

6.3.1.2 Online catalog

For searching the online metadata catalog, the base url (http://10.10.10.10/) must be replaced by https://csw.s-enda.k8s.met.no/:

http://csw.s-enda.k8s.met.no/?mode=opensearch&service=CSW&version=2.0.2&request=GetRecords&elementsetname=full&typenames=csw:Record&resulttype=results

6.3.1.3 OpenSearch examples

To find all datasets in the catalog:

https://csw.s-enda.k8s.met.no/?mode=opensearch&service=CSW&version=2.0.2&request=GetRecords&elementsetname=full&typenames=csw:Record&resulttype=results

Or datasets within a given time span:

http://csw.s-enda.k8s.met.no/?mode=opensearch&service=CSW&version=2.0.2&request=GetRecords&elementsetname=full&typenames=csw:Record&resulttype=results&time=2000-01-01/2020-09-01

Or datasets within a geographical domain (defined as a box with parameters min_longitude, min_latitude, max_longitude, max_latitude):

https://csw.s-enda.k8s.met.no/?mode=opensearch&service=CSW&version=2.0.2&request=GetRecords&elementsetname=full&typenames=csw:Record&resulttype=results&bbox=0,40,10,60

Or, datasets from any of the Sentinel satellites:

https://csw.s-enda.k8s.met.no/?mode=opensearch&service=CSW&version=2.0.2&request=GetRecords&elementsetname=full&typenames=csw:Record&resulttype=results&q=sentinel

6.3.2 Advanced geographical search

PyCSW opensearch only supports geographical searches querying for a box. For more advanced geographical searches, one must write specific XML files. For example:

To find all datasets containing a point:

<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
<csw:GetRecords
    xmlns:csw="http://www.opengis.net/cat/csw/2.0.2"
    xmlns:ogc="http://www.opengis.net/ogc"
    xmlns:gml="http://www.opengis.net/gml"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    service="CSW"
    version="2.0.2"
    resultType="results"
    maxRecords="10"
    outputFormat="application/xml"
    outputSchema="http://www.opengis.net/cat/csw/2.0.2"
    xsi:schemaLocation="http://www.opengis.net/cat/csw/2.0.2 http://schemas.opengis.net/csw/2.0.2/CSW-discovery.xsd" >
  <csw:Query typeNames="csw:Record">
    <csw:ElementSetName>full</csw:ElementSetName>
    <csw:Constraint version="1.1.0">
      <ogc:Filter>
        <ogc:Contains>
          <ogc:PropertyName>ows:BoundingBox</ogc:PropertyName>
          <gml:Point>
            <gml:pos srsDimension="2">59.0 4.0</gml:pos>
          </gml:Point>
        </ogc:Contains>
      </ogc:Filter>
    </csw:Constraint>
  </csw:Query>
</csw:GetRecords>

To find all datasets intersecting a polygon:

<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
<csw:GetRecords
    xmlns:csw="http://www.opengis.net/cat/csw/2.0.2"
    xmlns:gml="http://www.opengis.net/gml"
    xmlns:ogc="http://www.opengis.net/ogc"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    service="CSW"
    version="2.0.2"
    resultType="results"
    maxRecords="10"
    outputFormat="application/xml"
    outputSchema="http://www.opengis.net/cat/csw/2.0.2"
    xsi:schemaLocation="http://www.opengis.net/cat/csw/2.0.2 http://schemas.opengis.net/csw/2.0.2/CSW-discovery.xsd" >
  <csw:Query typeNames="csw:Record">
    <csw:ElementSetName>full</csw:ElementSetName>
    <csw:Constraint version="1.1.0">
      <ogc:Filter>
        <ogc:Intersects>
          <ogc:PropertyName>ows:BoundingBox</ogc:PropertyName>
          <gml:Polygon>
            <gml:exterior>
              <gml:LinearRing>
                <gml:posList>
                  47.00 -5.00 55.00 -5.00 55.00 20.00 47.00 20.00 47.00 -5.00
                </gml:posList>
              </gml:LinearRing>
            </gml:exterior>
          </gml:Polygon>
        </ogc:Intersects>
      </ogc:Filter>
    </csw:Constraint>
  </csw:Query>
</csw:GetRecords>

Then, you can query the CSW endpoint with, e.g., python:

import requests
requests.post('https://csw.s-enda.k8s.met.no', data=open(my_xml_request).read()).text

6.3.3 Web portals

GeoNorge.no

TODO: describe how to search in geonorge, possibly with screenshots

6.3.4 QGIS

MET Norway’s S-ENDA CSW catalog service is available at https://csw.s-enda.k8s.met.no. This can be used from QGIS as follows:

Select Web > MetaSearch > MetaSearch menu item
Select Services > New
Type, e.g., csw.s-enda.k8s.met.no for the name
Type https://csw.s-enda.k8s.met.no for the URL

Under the Search tab, you can then add search parameters, click Search, and get a list of available datasets.

Acknowledgements

At various stages during the writing of the first version of this handbook, we have solicited comments on the manuscript from coworkers at MET Norway (in alphabetical order): Åsmund Bakketun, Arild Burud, Lara Ferrighi, Håvard Futsæter and Nina Larsgård. Their comments and advice are gratefully acknowledged. In addition, we thank members of the top management at MET Norway, including Lars-Anders Breivik, Bård Fjukstad, Jørn Kristensen, Anne-Cecilie Riiser, Roar Skålin and Cecilie Stenersen, who have provided valuable criticism and advice.

While working on the second version of this handbook, valuable input has come from Matteo De Stefano from NINA. This input has made it possible to transform this handbook into a tool that can be adopted by institutions outside of METNorway.

Glossary of Terms and Names

Term	Description
Application service	TBC
CDM dataset	A dataset that “may be a NetCDF, HDF5, GRIB, etc. file, an OPeNDAP dataset, a collection of files, or anything else which can be accessed through the NetCDF API.” Unidata Common Data Model
Configuration metadata	See Configuration metadata definition in Table 2
Controlled vocabulary	A carefully selected list of terms (words and phrases) controlled by some authority. They are used to tag information elements (such as datasets) so that they are easier to search for. (see Wikipedia article) A basic element in the implementation of the Semantic web.
Data Governance	Tech Target (https://searchdatamanagement.techtarget.com/definition/data-governance). An alternative definition by George Firican: “Data Governance is the discipline which provides all data management practices with the necessary foundation, strategy, and structure needed to ensure that data is managed as an asset and transformed into meaningful information.” (http://www.lightsondata.com/what-is-data-governance/ which also contains several more definitions.)
Data life cycle management	“Data life cycle management (DLM) is a policy-based approach to managing the flow of an information system’s data throughout its life cycle: from creation and initial storage to the time when it becomes obsolete and is deleted.” Excerpt from TechTarget article. Alias: life cycle management
Data Management Plan	“A data management plan (DMP) is a written document that describes the data you expect to acquire or generate during the course of a research project, how you will manage, describe, analyse, and store those data, and what mechanisms you will use at the end of your project to share and preserve your data.” Stanford Libraries
Data centre	A combination of a (distributed) data repository and the data availability services and information about them (e.g., a metadata catalog). A data centre may include contributions from several other data centres.
Data management	How data sets are handled by the organisation through the entire value chain - include receiving, storing, metadata management and data retrieval.
Data provenance	“The term ‘data provenance’ refers to a record trail that accounts for the origin of a piece of data (in a database, document or repository) together with an explanation of how and why it got to the present place.” (Gupta, 2009). See also Boohers (2015)
Data repository	A set of distributed components that will hold the data and ensure they can be queried and accessed according to agreed protocols. This component is also known as a Data Node.
Dataset	A dataset is a pre-defined grouping or collection of related data for an intended use. Datasets may be categorised by: Source, such as observations (in situ, remotely sensed) and numerical model projections and analyses; Processing level, such as “raw data” (values measured by an instrument), calibrated data, quality-controlled data, derived parameters (preferably with error estimates), temporally and/or spatially aggregated variables; Data type, including point data, sections and profiles, lines and polylines, polygons, gridded data, volume data, and time series (of points, grids, etc.).
Data having all of the same characteristics in each category, but different independent variable ranges and/or responding to a specific need, are normally considered part of a single dataset. In the context of data preservation a dataset consists of the data records and their associated knowledge (information, tools). In practice, our datasets should conform to the Unidata CDM dataset definition, as much as possible.
Discovery metadata	See Discovery metadata definition in Table 2
Dynamic geodata	Data describing geophysical processes which are continuously evolving over time. Typically these data are used for monitoring and prediction of the weather, sea, climate and environment. Dynamic geodata is weather, environment and climate-related data that changes in space and time and is thus descriptive of processes in nature. Examples are weather observations, weather forecasts, pollution (environmental toxins) in water, air and sea, information on the drift of cod eggs and salmon lice, water flow in rivers, driving conditions on the roads and the distribution of sea ice. Dynamic geodata provides important constraints for many decision-making processes and activities in society.
FAIR principles	The four foundational principles of good data management and stewardship: Findability, Accessibility, Interoperability and Reusability. Nature article [RD3], FAIR Data Principles, FAIR metrics proposal, EU H2020 Guidelines
Feature type	A categorisation of data according to how they are stored, for example, grid, time series, profile, etc. It has been formalised in the NetCDF/CF feature type table, which currently defines eight feature types.
Geodataloven	"Norwegian regulation toward good and efficient access to public geographic information for public and private purposes." See Deling av geodata – Geodataloven.
Geonorge	"Geonorge is the national website for map data and other location information in Norway. Users of map data can search for any such information available and access it here." See Geonorge.
Geographic Information System	A geographic information system (GIS) is a system designed to capture, store, manipulate, analyze, manage and present spatial or geographic data. (Clarke, K. C., 1986) GIS systems have lately evolved in distributed Spatial Data Infrastructures (SDI)
Glossary	Terms and their definitions, possibly with synonyms.
Interoperability	The ability of data or tools from non-cooperating resources to integrate or work together with minimal effort.
Linked data	A method of publishing structured data so that they can be interlinked and become more useful through semantic queries, i.e., through machine-machine interactions. (see Wikipedia article)
Ontology	A set of concepts with attributes and relationships that define a domain of knowledge.
OpenSearch	A collection of simple formats for the sharing of search results (OpenSearch)
Product	"Product" is not a uniquely defined term among the various providers of dynamical geodata, either nationally or internationally. It is often used synonymously with "dataset." For the sake of clarity, "product" is not used in this handbook. The term "dataset" is adequate for our purpose.
Semantic web	“The Semantic Web provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries". W3C (see Wikipedia article)
Site metadata	See Site metadata definition in Table 2
Spatial Data Infrastructure	"Spatial Data Infrastructure (SDI) is defined as a framework of policies, institutional arrangements. technologies, data, and people that enables the sharing and effective usage of geographic information by standardising formats and protocols for access and interoperability." (Tonchovska et al, 2012). SDI has evolved from GIS. Among the largest implementations are: NSDI in the USA, INSPIRE in Europe and UNSDI as an effort by the United Nations. For areas in the Arctic, there is arctic-sdi.org.
Unified data management	A common approach to data management in a grouping of separate data management enterprises.
Use metadata	See Use metadata definition in Table 2
Web portal	A central website where all users can search, browse, access, transform, display and download datasets irrespective of the data repository in which the data are held.
Web service	Web services are used to communicate metadata, data and to offer processing services. Much effort has been put on standardisation of web services to ensure they are reusable in different contexts. In contrast to web applications, web services communicate with other programs, instead of interactively with users. (See TechTerms article)
Workflow management	Workflow management is the process of tracking data, software and other actions on data into a new form of the data. It is related to data provenance, but is usually used in the context of workflow management systems.
(Scientific) Workflow management systems	A scientific workflow system is a specialised form of a workflow management system designed specifically to compose and execute a series of computational or data manipulation steps, or workflow, in a scientific application. (Wikipedia) As of today, many different frameworks exist with their own proprietary languages, these might eventually get connected by using a common workflow definition language.

List of Acronyms

This list contains acronyms used throughout the DMH. The column "General/Specific" indicates if the acronyms are used in the general part of the DMH (from the template) or if they are organisation specific.

Acronym	Meaning	General/specific
ACDD	Attribute Convention for Dataset Discovery RD5
ADC	Arctic Data Centre (ADC)
AeN	Arven etter Nansen (English: Nansen Legacy)
BUFR	Binary Universal Form for the Representation of meteorological data. WMO standard format for binary data, particularly for non-gridded data (BUFR)
CDM	Unidata Common Data Model (CDM)
CF	Climate and Forecast Metadata Conventions (CF)
CMS	Content Management System
CSW	Catalog Service for the Web (CSW)
DAP	Data access protocol (DAP)
DBMS	DataBase Management System (DBMS)
DIANA	Digital Analysis tool for visualisation of geodata, open source from MET Norway (DIANA)
diana-WMS	WMS implementation in DIANA
DIAS	Copernicus Data and Information Access Services (DIAS)
DIF	Directory Interchange Format of GCMD (DIF)
DLM	Data life cycle management (DLM)
DM	Data Manager
DMH	Data Management Handbook (this document)
DMCG	Data Management Coordination Group
DMP	Data Management Plan (DMP definition, easyDMP tool)
DOI	Digital Object Identifier (DOI)
eduGAIN	The Global Academic Interfederation Service (eduGAIN)
ENVRI	European Environmental Research Infrastructures (ENVRI)
ENVRI FAIR	"Making the ENV RIs data services FAIR." A proposal to the EU’s Horizon 2020 call INFRAEOSC-04
EOSC	European Open Science Cloud (EOSC)
ERDDAP	NOAA Environmental Research Division Data Access Protocol (ERDDAP)
ESA	European Space Agency (ESA)
ESGF	Earth System Grid Federation (ESGF)
EWC	European Weather Cloud ()
FAIR	Findability, Accessibility, Interoperability and Reusability RD3
FEIDE	Identity Federation of the Norwegian National Research and Education Network (UNINETT) (FEIDE)
FFI	Norwegian Defence Research Establishment (FFI)
FORCE11	Future of Research Communication and e-Scholarship (FORCE11)
GCMD	Global Change Master Directory (GCMD)
GCW	Global Cryosphere Watch (GCW)
GeoAccessNO	An NFR-funded infrastructure project, 2015- (GeoAccessNO)
GIS	Geographic Information System
GRIB	GRIdded Binary or General Regularly-distributed Information in Binary form. WMO standard file format for gridded data (GRIB)
HDF, HDF5	Hierarchical Data Format (HDF)
Hyrax	OPeNDAP 4 Data Server (Hyrax)
IMR	Institute of Marine Research (IMR)
INSPIRE	Infrastructure for Spatial Information in Europe (INSPIRE)
ISO 19115	ISO standard for geospatial metadata (ISO 19115-1:2014).
IPY	International Polar Year (IPY)
JRCC	Joint Rescue Coordination centre (Hovedredningssentralen)
KDVH	KlimaDataVareHus	Specific to METNorway
KPI	Key Performance Indicator (KPI)
METCIM	MET Norway Crisis and Incident Management (METCIM)	Specific to METNorway
METSIS	MET Norway Scientific Information System	Specific to METNorway
MMD	Met.no Metadata Format MMD
MOAI	Meta Open Archives Initiative server (MOAI)
ncWMS	WMS implementation for NetCDF files (ncWMS)
NERSC	Nansen Environmental and Remote Sensing Center (NERSC)
NetCDF	Network Common Data Format (NetCDF)
NetCDF/CF	A common combination of NetCDF file format with CF-compliant attributes.
NFR	The Research Council of Norway (NFR)
NILU	Norwegian Institute for Air Research (NILU)
NIVA	Norwegian Institute for Water Research (NIVA)
NMDC	Norwegian Marine Data Centre, NFR-supported infrastructure project 2013-2017 (NMDC)
NorDataNet	Norwegian Scientific Data Network, an NFR-funded project 2015-2020 (NorDataNet)
Norway Digital	Norwegian national spatial data infrastructure organisation (Norway Digital). Norwegian: Norge digitalt
NORMAP	Norwegian Satellite Earth Observation Database for Marine and Polar Research, an NFR-funded project 2010-2016 (NORMAP)
NRPA	Norwegian Radiation Protection Authority (NRPA)
NSDI	National Spatial Data Infrastructure, USA (NSDI)
NVE	Norwegian Water Resources and Energy Directorate (NVE)
NWP	Numerical Weather Prediction
OAI-PMH	Open Archives Initiative - Protocol for Metadata Harvesting (OAI-PMH)
OAIS	Open Archival Information System (OAIS)
OCEANOTRON	Web server dedicated to the dissemination of ocean in situ observation data collections (OCEANOTRON)
OECD	The organisation for Economic Co-operation and Developement. OECD
OGC	Open Geospatial Consortium (OGC)
OGC O&M	OGC Observations and Measurements standard (OGC O&M)
OLA	Operational-level Agreement (OLA)
OPeNDAP	Open-source Project for a Network Data Access Protocol (OPeNDAP) - reference server implementation
PID	Persistent Identifier (PID)
RM-ODP	Reference Model of Open Distributed Processing (RM-ODP)
PROV	A W3C Working Group on provenance and a Family of Documents (PROV)
SAON	Sustaining Arctic Observing Networks (SAON/IASC)
SDI	Spatial Data Infrastructure
SDN	SeaDataNet, Pan-European infrastructure for ocean & marine data management
SIOS	Svalbard Integrated Arctic Earth Observing System
SIOS-KC	SIOS Knowledge Centre, an NFR-supported project 2015-2018 (SIOS-KC)
SKOS	Simple Knowledge Organization System (SKOS)
SLA	Service-level Agreement (SLA)
SolR	Apache Enterprise search server with a REST-like API (SolR)
StInfoSys	MET Norway’s Station Information System	Specific to METnorway
TDS	THREDDS Data Server (TDS)
THREDDS	Thematic Real-time Environmental Distributed Data Services
UNSDI	United Nations Spatial Data Infrastructure (UNSDI)
UUID	Universally Unique Identifier (UUID)
W3C	World Wide Web Consortium (W3C)
WCS	OGC Web Coverage Service (WCS)
WFS	OGC Web Feature Service (WFS)
WIGOS	WMO Integrated Global Observing System (WIGOS)
WIS	WMO Information System (WIS)
WMO	World Meteorological Organisation (WMO)
WMS	OGC Web Map Service (WMS)
WPS	OGC Web Processing Service (WPS)
YOPP	Year of Polar Prediction (YOPP Data Portal)

Appendix A: list of referenced softwares & services

NINA's data management handbook