BELTRANS data model specification

Throughout the document, the following terminology is used (sorted alphabetically).

Belgian Bibliography

The Belgian Bibliography exists since 1875 and published online since 1998. It consists of all publications published in Belgium or publications by Belgian authors who are domiciled in Belgium published abroad. The bibliography currenclty consists of ten main categories like 200 Theology. Religions. or 800 Literature. History and literary criticism.

BELTRANS genres

These genres from the Belgian Bibliography are the main focus of the BELTRANS PhD students. Genres with the following prefix are included: 81, 83, 84, 85, 86, 900, 92, 93, 95, 96, 97. Thus for instance 810 Poetry. or 850 Comics..

Correlation list

A spreadsheet in which a human validator specifies one entity per row and indicates local identifiers of this entity in different columns. I.e. the human validator correlates the different local records by aligning their identifiers in a single row.

Contributor

A person or organization that contributed to the creation of a translation or an original. We use MARC relator codes to indicate the role that a contributor had in the creation. See also nationality filter that was performed with the help of assigned roles.

In-house translation

A specific type of Translation that was created by an often unknown Translator internal to the publishing process (e.g. an employee of an organizational Publisher).

Literal value

According to the definition in the RDF standard, a literal is used for values such as strings, numbers and dates. It is thus not a URI. Examples are "Brussels", "42" or "2024-01-01". It can be a language-tagged string like "Brussels"@en or "Bruxelles"@fr or be described with a data type like "2024-01-01"^^xsd:date

Named-Graph

The context URI of an RDF triple, i.e. a fourth component which makes an RDF triple an RDF quad. Among others, this allows to select several RDF triples with the same context URI, thus "within the same named graph".

Nationality filter

Within BELTRANS we are interested in translations where Belgian's were involved. The nationality filter applies if contributors of one of the following MARC roles are identified to be Belgians (the MARC code is provided in brackets): authors (aut), scenarists (sce), illustrators (ill) or publishing directors (pbd)

Original

A book that was published by a Publisher in a specific Source Language and that was not published before in other languages.

PROV agent

According to PROV-DM, an agent is something that bears some form of responsibility for an activity taking place, for the existence of an entity, o rfor another agent's activity.

PROV activity

According to PROV-DM, an activity is something that occurs over a period of time and acts upon or with entities; it may include consuming, processing, transforming, modifying, relocating, using, or generating entities.

PROV entity

According to PROV-DM, an entity is a physical, digital, conceptual, or other kind of thing with some fixed aspects; entities may be real or imaginary.

Publisher

An organization or an individual that makes information or other forms of creative works available to the public for sale or free of charge, i.e. an organization or an individual that publishes information or other forms of creative works.

RDF Triple

A statement with a subject, predicate and object, where the subject and predicate are Uniform Resource Identifiers (URIs) and the object can be a URI or a Literal value. See ...

RDF Quad

A statement with a context, subject, predicate and object, where the context, subject and predicate are Uniform Resource Identifiers (URIs) and the object can be a URI or a Literal value. In the following example, the definition of two books is in the named graph ex:kbr, whereas additional location information is in the named graph ex:geo.

ex:book1 a schema:CreativeWork ex:kbr .
ex:book2 a schema:CreativeWork ex:kbr .
ex:book1 schema:locationCreated ex:brussels ex:geo .
ex:brussels a schema:Place ex:geo .

RML

The RDF Mapping Language (RML) is a mapping language defined to express customized mapping rules from heterogeneous data structures and serializations to the RDF data model.

Source Language

The language a book was originally published in, i.e. the language of an Original.

Target Language

The language of a published book that was translated by a Translator from a Source Language, i.e. the language of a Translation.

Translation

A book that was published by a Publisher and that is the result of a translation activity, i.e. a known/unknown Translator was responsible to translate the Original book from a Source Language to a Target Language.

Translator

Someone who translates a (published) book from a Source Language to a Target Language, usually a person. This person might be known or unknown, the latter is often the case for In-house translations where no translator is indicated in the published book or the book metadata.

UUID

A Universally Unique Identifier (UUID) is a number of 32 hexadecimal digits that is basically unique. An example is ea8c2233-8694-4e13-998a-c3592eddad5f. The first block consists of random digits, the second block is based on the timestamp, the third block starts with 4, indicating that this is a UUID of version 4, the fourth block are clock sequence bits, and the last block is random again (as this is version 4, in other versions the last block is based on the MAC address of the machine that generated the UUID).

alternate name

Authorities (such as persons or organizations) may have several spellings. Next to the preferred-name spelling, alternate names record other possible spellings. This makes them useful for Information Retrieval.

authority

This term from Information science, often used in the library domain, specifies a record within a controlled vocabulary. Specifically it covers "authorized forms of names, subjects and subject subdivisions" [MARC21Aut]. Historically used as a uniform way to spell something and hence a form of identifier.

data minimization

This is a global principle about processing only the necessary amount of personal data. Within Europe it is a fundamental privacy principle as part of the General Data Protection Regulation ([GDPR]).

data-subject

According to the GDPR: ...

entity

Todo: definition of an entity

ISNI

The International Standard Name Identifier (ISO 27729) is used to uniquely identify persons and organisations involved in the creation, production, management and distribution of cultural content. ISNI is a unique and permanent 16-digit number. KBR, the main coordinating partner of MetaBelgica is ISNI registration agency and is empowered to assign new ISNI identifiers.

OWA

The Open World Assumption (OWA) specifies that ...

property

Todo: definition of a property

property instance

The usage of a property on an entity, i.e. the value of the property. If John is a person entity, Ghent a city entity and place of birth a property. Then the statement that John was born in Ghent is the property instance.

preferred-name

Authorities (such as persons or organizations) may have several spellings. Next to zero or more alternate names, preferred-names are the historically uniform way to denote the name in authority control. For modern web-based systems, a single preferred name (per language) can also serve as "the" label of a record.

This sections provides an overview of a translation and furthermore focuses on the following different aspects and links to other concepts: activities, contributors, originals and locations.

The following figure provides an overview of the different concepts and how they relate to each other. In the remainder of the section we briefly discuss the underlying structure and in the next section we zoom into the different aspects of the data model.

todo: general picture, eventually with different colors, and each colors is then further specified in another section

Terms from the PROV ontology, PROV-O (based on the PROV data model (PROV-DM)) are used or extended, but moreover the general principles of PROV are applied in the BELTRANS data model: namely that we have agents, activities and entities.

A PROV agent is the overall concept that we use for actors. Concretely we use the following two classes of schema.org to indicate such agents:

schema:Person
schema:Organization

We use the following PROV activities that are subclasses we defined on prov:Activity.

btm:TranslationActivity, to be able to specify the creation of a translation based on an original and the related contributor roles.
btm:CorrelationActivity: Within BELTRANS data from different sources are integrated by common identifiers (e.g. two book records based on common ISBN). However, whenever there was no common identifier or where we could not alter one of the data sources, we used correlation lists. We use this class to indicate that certain entities in our data come from such a manually curated spreadsheet.
btm:CorrelationRemovalActivity: Similar to the explanation above, but such activities are used to indicate which manual curated entities should be removed.

We use the following PROV entities:

Translations
Originals
Works

The following sections specify which concrete classes we use to indicate such entities.

We sometimes have to make statements about something based on a certain motivation. For this we make use of the W3C Web Annotations standard.

One of the key concepts of the corpus are Translations. In the following we describe different properties and provide examples.

Types of translations We reuse the generic class schema:CreativeWork to declare translations. Furthermore we use the following self-created classes to further describe specific subsets of translations.

Note

Initially we used the property schema:isPartOf to denote specific subsets, for example ex:book schema:isPartOf btid:beltransCorpus. However, applications such as SAMPO-UI require to classes to specify what is shown in a user interface perspective.

btm:BeltransTranslation (schema:isPartOf btid:beltransCorpus)
- Instances of this class are translations between Dutch or French, published between 1970-2020 and from which at least one contributor is of Belgian nationality. (see nationality filter for more details.
btm:BeltransGenreTranslation (schema:isPartOf btid:beltransGenre)
- Instances of this class are translations that are assigned one of the beltrans genres.
btm:MultilingualManifestation

One can use a combination of these classes to query different subsets. For example all translations of BELTRANS genre that also pass the nationality filter (instances of btm:BeltransTranslation AND btm:BeltransGenreTranslation.

The main difference is that in the project we mainly focused on instances of btm:BeltransTranslation with special attention to instances of btm:BeltransGenreTranslation. This means that instances of those classes did undergo more manual refinements.

Identifiers of translations Each translation has a unique UUID identifier. As we integrated data from different data sources, a translation may have more local identifiers. Additionally books are usually identified by the International Standard Book Number (ISBN), either in its old variant, the 10 digit ISBN-10 or the modern 13 digit ISBN-13. The unique BELTRANS ID is indicated with the property dcterms:identifier. Other identifiers are indicated by using the BIBFRAME ontology.

Example 1

#
# A book linking to different instances of bf:Identifier
# as well as a direct link to an ISBN-10
#
ex:book a schema:CreativeWork ;
        dcterms:identifier "ea8c2233-8694-4e13-998a-c3592eddad5f" ;
        bibo:isbn10 "..." ;
        bf:identifiedBy ex:bookKBR ;
        bf:identifiedBy ex:bookKB ;
        bf:identifiedBy ex:bookBnF ;
        bf:identifiedBy ex:bookUnesco ;
        bf:identifiedBy ex:bookISBN10 .

ex:bookKBR a bf:Identifier ;
           rdfs:label "KBR" ;
           rdf:value "..." .

ex:bookBnF a bf:Identifier ;
           rdfs:label "BnF" ; 
           rdf:value "..." .

ex:bookUnesco a bf:Identifier ;
              rdfs:label "Unesco" ;
              rdf:value "..." .

ex:bookISBN10 a bf:Identifier ;
              rdfs:label "ISBN-10" ;
              rdf:value "..." .

Note

The BIBFRAME ontology also has specific subclasses of bf:Identifier, such as bf:Isni which then does not require a dedicated rdfs:label (as the name is implied by the specific class). However, not all identifiers have a specific subclass and because we want a generic SPARQL query to obtain all identifiers and their name we always indicate the generic class bf:Identifier as well as the related rdfs:label and rdf:value.

We indicate one or more genres of translations with the property schema:about and a URI that represents a genre of the Belgian Bibliography. The URIs are build based on the internal KBR LEXICON code. For example LEXICON_000000090 for 850 Comics..

In our PROV-O based data model, translations are the result of a translation activity.

Todo

A translation links to contributors in a redundant way.

Direct links between translations and contributors are made with RDF properties based on MARC relator codes.
Direct links via schema.org properties exist for a small subset of roles, namely schema:author, schema:translator and schema:publisher.
PROV-O annotations in the form of prov:Association instances with the property prov:hadRole and the related translation activity.

Note

Currently all three ways of indicating contributors are explicitly part of the RDF data. These annotations were created with a single RML mapping file. Another option would be to only indicate the role using the third technique and derive the first two automatically via SPARQL INSERT queries or, on query time, with reasoning rules.

Example 2

#
# Definition of a translation with direct links to contributor with role-specific attributes
#
ex:book a schema:CreativeWork ;
        prov:wasGeneratedBy ex:bookTranslationActivity ;
        schema:author ex:person1 ;
        marcrel:aut ex:person1 ;
        marcrel:ill ex:person2 .

#
# The translation activity linking to one association per contributor
#
ex:bookTranslationActivity a prov:Activity ;
                           prov:generated ex:book ;
                           prov:qualifiedAssociation ex:bookPerson1Assoc ;
                           prov:qualifiedAssociation ex:bookPerson2Assoc .

#
# Associations indicating in which role a contributor contributed to an activity
#
ex:bookPerson1Assoc a prov:Association ;
                    prov:hadRole btid:role_aut ;
                    prov:activity ex:bookTranslationActivity .

ex:bookPerson2Assoc a prov:Association ;
                    prov:hadRole btid:role_ill ;
                    prov:activity ex:bookTranslationActivity .

Location information is either stored as literal values or after an enrichment step by using URL entities.

btm:sourceLanguage

A translation has the following literal properties that we did not yet discuss in detail:

schema:name
schema:datePublished
bibo:isbn10
bibo:isbn13
rdfs:comment
rdfs:label

Persons and Organizations

Mention the different roles and how roles are assigned based on specific properties or via general prov:role association

btm:hasNameVariant
btm:hasPseudonym
btm:isPseudonymOf

How geo information is encoded

btm:isoCode
btm:matchCandidate

During the project we used several named-graphs to store the data. However, this makes SPARQL queries more complex and hence we also provide a serialization in a single graph for easier accessibility.

During the project we used several named-graphs to store the data. i.e. statements about a resource in one graph, but additional statements in other graphs. This allowed us to perform update operations on particular subsets of the statements.

We used the following named graphs to store (generated) RDF per data source.

http://master-data
http://isni-sru
http://kbr-syracuse
http://kbr-linked-authorities
http://kbr-originals
http://bnf-originals
http://bnf-publications
http://kb-publications
http://kb-linked-authorities
http://kb-originals
http://unesco

The following named graphs contain the integrated data, i.e. the BELTRANS database with dedicated records, referring to source records in the respective data source named graphs.

http://beltrans-manifestations
http://beltrans-contributors
http://beltrans-geo
http://beltrans-originals
http://beltrans-works

For example, based on properties with textual location information such as ex:book schema:locationCreated "Brussel", stored in one named graph, we had a Python script that created structured referenced data which we stored in another named graph, e.g. ex:book schema:locationCreated ex:brussel . ex:brussel rdf:type schema:Place ; rdfs:label "Brussel" .. In a SPARQL query one can indicate which schema:locationCreated you are interested in by specifying one of the named graphs.

The consolidated corpus data are available in a single graph. This required a data migration from the more-rich multi-graph setup to a single graph in a semantically sound way. For example, querying schema:locationCreated values from our corpus in a single graph would result in literal values such as "Brussels" as well as instances of schema:Place if we simply would copy everything over. Hence we have to make sure that we get uniform query results by for example using different properties or decide to only migrate one of the representations, in this example this is either the literal or the object.

BELTRANS data model specification

Abstract

Status of This Document

1. Introduction

1.1 Conformance

1.2 Open World Assumption

2. Terminology

3. Data model

3.1 Overview

3.1.1 PROV-O model

3.1.1.1 Agents

3.1.1.2 Activities

3.1.1.3 Entities

3.1.2 Web annotations

3.2 Translations

3.2.1 Genre of translations

3.2.2 Creation of translations

3.2.3 Links to contributors

3.2.4 Links to locations

3.2.5 ??

3.2.6 Other literal values of translations

3.3 Contributors

3.4 Location

3.5 Originals

3.6 Works

4. Corpus Serialization

4.1 Multi-graph serialization

4.2 Single-graph serialization

A. References

A.1 Normative references