Metadata is a message to the world

4.3.2020

Metadata is condensed information about data, quick to browse and utilise.

By choosing the information you enter as metadata – or whether to enter any at all – you are choosing what the lifecycle of your research material will be like. Wide-ranging or at least sufficient metadata enables your research material to lead a long and good life. Extensive metadata is a messenger for your research material; it communicates the research results to the research community. It also enables unforeseen new uses. Insufficient metadata makes re-use more difficult or impossible, by you and by others.

On the bookshop non-fiction shelves, books are arranged by author and by subject. One of the books catches your eye. The blurb on the back cover provides enough information about the content to increase that interest. The book is quite new, published this year. The quotes on the dust jacket look good. The contents page is even more interesting – you decide to buy the book. What happened there? Metadata helped you to make the decision.

Metadata is condensed information about something, quick to browse and utilise. In digitalising research environments, searching for information, examining it and selecting it all happen electronically. In an age of data networks, search engines and artificial intelligence, the progress of machine readability is making information more widely visible and thus hopefully more influential.

Indexes and data templates

In order for information about data to be able to be found, its key descriptive data, i.e. metadata, needs to be stored in a public indexing service. Different indexing services present descriptive information to potential users. The description is thus communication. When we fetch data from the data network and databases, services and applications find it in line with criteria that we have set.

As a separate file, the metadata file size is much smaller than the file it describes so processing it does not take time. We do not have to go through vast raw files and this means that searching and browsing are fast. Well drawn-up metadata thus makes our lives easier.

Digitalisation also affects which metadata needs to be entered. The data model will set out the necessary metadata, describing the structure of the data and the contexts irrespective of the technical system. The data model makes data repositories, in this case research material, understandable by people and computers.

Indexing services use a data model – not always exactly the same but to a growing extent interoperable between research communities. This interoperability means that indexing services are able to exchange metadata between them, in other words, the metadata is transferred in machine-readable form between services and in other data streams. This encourages system independence as well-described data is easier to transfer from one system to another.

Filling in metadata

Typically, metadata describes creation data, content structure, file format, rights, keywords and many other things that are useful in searching and evaluating data. These can be classified into different categories, e.g. as in the following groupings by Airi Salminen:

  1. Semantic metadata, which is data that describes the meaning of the content, e.g. keywords, document title, subject, summary
  2. Structural metadata, which is data describing the physical or logical structure of the content unit or the language of the content
  3. Contextual metadata which describes the environment of the content units in some particular situation, e.g. time the content unit was created, the producer, user and relationships with other content units.

Some of this metadata can only be provided by the producer of the information for the files they have created. Some of the metadata can be stored automatically when the document is created or when it is edited – e.g. the metadata automatically saved with a Word file (author, editor, file size, last edited, etc.)

When describing semantic content, it is recommended to use ready-made glossaries, either for specialist areas or general ones. This way, the meaning of the information is more easily related to commonly agreed meanings. For example, Finto is a common Finnish thesaurus and ontology service which enables the publication and browsing of vocabularies. The service also offers interfaces for integrating the thesauri and ontologies into other applications and systems. The Helsinki Term Bank for the Arts and Sciences (HTB) is a multidisciplinary project which aims to gather a shared, open and constantly updated terminological database for all fields of research in Finland to be used by the scientific community and citizens.

Metadata should always include authorship and copyright information. Without asking permission, it is not appropriate to publish metadata from other people’s material. It is a good idea for authors or copyright holders to share their metadata by means of a licence which allows the opportunity to disseminate metadata to different indexing services. A Creative Commons licence, for example, can be used to share some of the author’s rights and give the desired freedom to people using, viewing or experiencing the work.

Metadata also improves access and usability. It is recommended that metadata be produced for all research data in line with what are known as the FAIR principles (Findable, Accessible, Interoperable and Reusable). Fulfilling the FAIR principles breaks down barriers limiting the spread of research data and offers an opportunity to disseminate metadata to all indexing services that support FAIR data. The FAIR principles also encourage the creation of persistent identifiers and links to the data and adding these to the metadata, among other things. A persistent identifier is an equivalent to a bibliographical code such as ISBN and ISSN for the digital era. A persistent identifier means that data will always be found in the future.

Into the digisphere

The most important outcome of good metadata, however, is connectedto research activities – they have a part to play in enabling responsible science. They make it easier for research results to be verifiable and replicable in the different phases of the lifecycle of the data and enable results to be made use of. In a world with huge amounts of digital data, they help to keep information alive, in other words see it updated, new versions produced and it re-used.

New ways of using data are also new ways of conducting research. Eventually, many research tools are mainstreamed to a wider audience. The World Wide Web is a prime example.

We are now living at a watershed in which in the future besides research data, digital devices, built environments and other elements will be part of the information technology world, the digisphere. High-quality metadata will help research data to step into this expanding digital sphere.

Pirjo-Leena Forsström is a Development Director at CSC – IT Centre for Science, Finland.


Further information:

Finto: https://finto.fi/en/
The Helsinki Term Bank for the Arts and Sciences: https://tieteentermipankki.fi/wiki/Termipankki:Etusivu/en
Creative Commons licences: https://creativecommons.fi/lisenssit/
FAIR principles: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4792175/
Airi Salminen: “Metatiedot organisaatioiden sisällönhallinnassa”. Published in Lehtinen, A., Salminen, A., Nurmeksela, R., Metadata support for the information management in the Finnish legislative process. RASKE2 project second preliminary report (pp. 4–13). Finnish Parliamentary Office publication 7/2005. Helsinki: Parliamentary Office.