BELTRANS data model specification

Unofficial Draft

More details about this document
Latest published version:
https://www.w3.org/beltrans-data-model/
Latest editor's draft:
https://w3id.org/beltrans/data-model/spec/
Editor:
Sven Lieber (Royal Library of Belgium (KBR))
Author:
Sven Lieber (Royal Library of Belgium (KBR))
This Version
https://w3id.org/metabelgica/data-model/spec/20260206/
Previous Version
https://w3id.org/metabelgica/data-model/spec/undefined/

Abstract

This document defines the data model of BELTRANS.

Status of This Document

This document is a draft of a potential specification. It has no official standing of any kind and does not represent the support or consensus of any standards organization.

This document was created as part of the BELSPO-funded BRAIN2.0 research project BELTRANS.

1. Introduction

The project Intra-Belgian literary translations since 1970 (BELTRANS) studies the untold history of literary translation flows in Belgium between French and Dutch in the period 1970-2020.

As part of the research activities, a corpus of bibliographic and authority metadata was created based on various data sources. For semantic interoperability we used the Resource Description Framework (RDF) to integrate the data.

In essence, we use concepts of the W3C Provenance Ontology (PROV-O) as basis and reuse as much as possible RDF terms from the common schema.org vocabulary and other more specialized vocabularies such as from the Bibframe ontology. Where necessary we defined our own terms.

This document specifies the used RDF terms and provides examples around the different key concepts

as well as how these terms were serialized in one or more RDF named-graphs.

1.1 Conformance

As well as sections marked as non-normative, all authoring guidelines, diagrams, examples, and notes in this specification are non-normative. Everything else in this specification is normative.

The key words MAY, MUST, MUST NOT, OPTIONAL, RECOMMENDED, REQUIRED, SHALL, SHALL NOT, SHOULD, and SHOULD NOT in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.

Conformance requirements are expressed with a combination of descriptive assertions and [RFC2119] terminology.

The key words MAY, MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL, in the normative parts of this document are to be interpreted as described in RFC 2119.

1.2 Open World Assumption

This data model follows the Open World Assumption (OWA), common in the web and Linked Open Data, hence a missing property does not mean that it does not exist, just that it is not known. For example, in cases where we do not know what the year of publication for a book is, we do not indicate "unknown" or something similar, we rather do not mention the property at all.

2. Terminology

Throughout the document, the following terminology is used (sorted alphabetically).

Belgian Bibliography
The Belgian Bibliography exists since 1875 and published online since 1998. It consists of all publications published in Belgium or publications by Belgian authors who are domiciled in Belgium published abroad. The bibliography currenclty consists of ten main categories like 200 Theology. Religions. or 800 Literature. History and literary criticism.
BELTRANS genres
These genres from the Belgian Bibliography are the main focus of the BELTRANS PhD students. Genres with the following prefix are included: 81, 83, 84, 85, 86, 900, 92, 93, 95, 96, 97. Thus for instance 810 Poetry. or 850 Comics..
Correlation list
A spreadsheet in which a human validator specifies one entity per row and indicates local identifiers of this entity in different columns. I.e. the human validator correlates the different local records by aligning their identifiers in a single row.
Contributor
A person or organization that contributed to the creation of a translation or an original. We use MARC relator codes to indicate the role that a contributor had in the creation. See also nationality filter that was performed with the help of assigned roles.
In-house translation
A specific type of Translation that was created by an often unknown Translator internal to the publishing process (e.g. an employee of an organizational Publisher).
Literal value
According to the definition in the RDF standard, a literal is used for values such as strings, numbers and dates. It is thus not a URI. Examples are "Brussels", "42" or "2024-01-01". It can be a language-tagged string like "Brussels"@en or "Bruxelles"@fr or be described with a data type like "2024-01-01"^^xsd:date
Named-Graph
The context URI of an RDF triple, i.e. a fourth component which makes an RDF triple an RDF quad. Among others, this allows to select several RDF triples with the same context URI, thus "within the same named graph".
Nationality filter
Within BELTRANS we are interested in translations where Belgian's were involved. The nationality filter applies if contributors of one of the following MARC roles are identified to be Belgians (the MARC code is provided in brackets): authors (aut), scenarists (sce), illustrators (ill) or publishing directors (pbd)
Original
A book that was published by a Publisher in a specific Source Language and that was not published before in other languages.
PROV agent
According to PROV-DM, an agent is something that bears some form of responsibility for an activity taking place, for the existence of an entity, o rfor another agent's activity.
PROV activity
According to PROV-DM, an activity is something that occurs over a period of time and acts upon or with entities; it may include consuming, processing, transforming, modifying, relocating, using, or generating entities.
PROV entity
According to PROV-DM, an entity is a physical, digital, conceptual, or other kind of thing with some fixed aspects; entities may be real or imaginary.
Publisher
An organization or an individual that makes information or other forms of creative works available to the public for sale or free of charge, i.e. an organization or an individual that publishes information or other forms of creative works.
RDF Triple
A statement with a subject, predicate and object, where the subject and predicate are Uniform Resource Identifiers (URIs) and the object can be a URI or a Literal value. See ...
RDF Quad
A statement with a context, subject, predicate and object, where the context, subject and predicate are Uniform Resource Identifiers (URIs) and the object can be a URI or a Literal value. In the following example, the definition of two books is in the named graph ex:kbr, whereas additional location information is in the named graph ex:geo.
ex:book1 a schema:CreativeWork ex:kbr .
ex:book2 a schema:CreativeWork ex:kbr .
ex:book1 schema:locationCreated ex:brussels ex:geo .
ex:brussels a schema:Place ex:geo .
RML
The RDF Mapping Language (RML) is a mapping language defined to express customized mapping rules from heterogeneous data structures and serializations to the RDF data model.
Source Language
The language a book was originally published in, i.e. the language of an Original.
Target Language
The language of a published book that was translated by a Translator from a Source Language, i.e. the language of a Translation.
Translation
A book that was published by a Publisher and that is the result of a translation activity, i.e. a known/unknown Translator was responsible to translate the Original book from a Source Language to a Target Language.
Translator
Someone who translates a (published) book from a Source Language to a Target Language, usually a person. This person might be known or unknown, the latter is often the case for In-house translations where no translator is indicated in the published book or the book metadata.
UUID
A Universally Unique Identifier (UUID) is a number of 32 hexadecimal digits that is basically unique. An example is ea8c2233-8694-4e13-998a-c3592eddad5f. The first block consists of random digits, the second block is based on the timestamp, the third block starts with 4, indicating that this is a UUID of version 4, the fourth block are clock sequence bits, and the last block is random again (as this is version 4, in other versions the last block is based on the MAC address of the machine that generated the UUID).
alternate name
Authorities (such as persons or organizations) may have several spellings. Next to the preferred-name spelling, alternate names record other possible spellings. This makes them useful for Information Retrieval.
authority
This term from Information science, often used in the library domain, specifies a record within a controlled vocabulary. Specifically it covers "authorized forms of names, subjects and subject subdivisions" [MARC21Aut]. Historically used as a uniform way to spell something and hence a form of identifier.
data minimization
This is a global principle about processing only the necessary amount of personal data. Within Europe it is a fundamental privacy principle as part of the General Data Protection Regulation ([GDPR]).
data-subject
According to the GDPR: ...
entity
Todo: definition of an entity
ISNI
The International Standard Name Identifier (ISO 27729) is used to uniquely identify persons and organisations involved in the creation, production, management and distribution of cultural content. ISNI is a unique and permanent 16-digit number. KBR, the main coordinating partner of MetaBelgica is ISNI registration agency and is empowered to assign new ISNI identifiers.
OWA
The Open World Assumption (OWA) specifies that ...
property
Todo: definition of a property
property instance
The usage of a property on an entity, i.e. the value of the property. If John is a person entity, Ghent a city entity and place of birth a property. Then the statement that John was born in Ghent is the property instance.
preferred-name
Authorities (such as persons or organizations) may have several spellings. Next to zero or more alternate names, preferred-names are the historically uniform way to denote the name in authority control. For modern web-based systems, a single preferred name (per language) can also serve as "the" label of a record.

3. Data model

This sections provides an overview of a translation and furthermore focuses on the following different aspects and links to other concepts: activities, contributors, originals and locations.

3.1 Overview

The following figure provides an overview of the different concepts and how they relate to each other. In the remainder of the section we briefly discuss the underlying structure and in the next section we zoom into the different aspects of the data model.

3.1.1 PROV-O model

Terms from the PROV ontology, PROV-O (based on the PROV data model (PROV-DM)) are used or extended, but moreover the general principles of PROV are applied in the BELTRANS data model: namely that we have agents, activities and entities.

3.1.1.1 Agents

A PROV agent is the overall concept that we use for actors. Concretely we use the following two classes of schema.org to indicate such agents:

  • schema:Person
  • schema:Organization
3.1.1.2 Activities

We use the following PROV activities that are subclasses we defined on prov:Activity.

  • btm:TranslationActivity, to be able to specify the creation of a translation based on an original and the related contributor roles.
  • btm:CorrelationActivity: Within BELTRANS data from different sources are integrated by common identifiers (e.g. two book records based on common ISBN). However, whenever there was no common identifier or where we could not alter one of the data sources, we used correlation lists. We use this class to indicate that certain entities in our data come from such a manually curated spreadsheet.
  • btm:CorrelationRemovalActivity: Similar to the explanation above, but such activities are used to indicate which manual curated entities should be removed.
3.1.1.3 Entities

We use the following PROV entities:

  • Translations
  • Originals
  • Works

The following sections specify which concrete classes we use to indicate such entities.

3.1.2 Web annotations

We sometimes have to make statements about something based on a certain motivation. For this we make use of the W3C Web Annotations standard.

3.2 Translations

One of the key concepts of the corpus are Translations. In the following we describe different properties and provide examples.

Types of translations We reuse the generic class schema:CreativeWork to declare translations. Furthermore we use the following self-created classes to further describe specific subsets of translations.

Note

Initially we used the property schema:isPartOf to denote specific subsets, for example ex:book schema:isPartOf btid:beltransCorpus. However, applications such as SAMPO-UI require to classes to specify what is shown in a user interface perspective.

One can use a combination of these classes to query different subsets. For example all translations of BELTRANS genre that also pass the nationality filter (instances of btm:BeltransTranslation AND btm:BeltransGenreTranslation.

The main difference is that in the project we mainly focused on instances of btm:BeltransTranslation with special attention to instances of btm:BeltransGenreTranslation. This means that instances of those classes did undergo more manual refinements.

Identifiers of translations Each translation has a unique UUID identifier. As we integrated data from different data sources, a translation may have more local identifiers. Additionally books are usually identified by the International Standard Book Number (ISBN), either in its old variant, the 10 digit ISBN-10 or the modern 13 digit ISBN-13. The unique BELTRANS ID is indicated with the property dcterms:identifier. Other identifiers are indicated by using the BIBFRAME ontology.

#
# A book linking to different instances of bf:Identifier
# as well as a direct link to an ISBN-10
#
ex:book a schema:CreativeWork ;
        dcterms:identifier "ea8c2233-8694-4e13-998a-c3592eddad5f" ;
        bibo:isbn10 "..." ;
        bf:identifiedBy ex:bookKBR ;
        bf:identifiedBy ex:bookKB ;
        bf:identifiedBy ex:bookBnF ;
        bf:identifiedBy ex:bookUnesco ;
        bf:identifiedBy ex:bookISBN10 .

ex:bookKBR a bf:Identifier ;
           rdfs:label "KBR" ;
           rdf:value "..." .

ex:bookBnF a bf:Identifier ;
           rdfs:label "BnF" ; 
           rdf:value "..." .

ex:bookUnesco a bf:Identifier ;
              rdfs:label "Unesco" ;
              rdf:value "..." .

ex:bookISBN10 a bf:Identifier ;
              rdfs:label "ISBN-10" ;
              rdf:value "..." .

Note

The BIBFRAME ontology also has specific subclasses of bf:Identifier, such as bf:Isni which then does not require a dedicated rdfs:label (as the name is implied by the specific class). However, not all identifiers have a specific subclass and because we want a generic SPARQL query to obtain all identifiers and their name we always indicate the generic class bf:Identifier as well as the related rdfs:label and rdf:value.

3.2.1 Genre of translations

We indicate one or more genres of translations with the property schema:about and a URI that represents a genre of the Belgian Bibliography. The URIs are build based on the internal KBR LEXICON code. For example LEXICON_000000090 for 850 Comics..

3.2.2 Creation of translations

In our PROV-O based data model, translations are the result of a translation activity.

Todo

3.2.5 ??

  • btm:sourceLanguage

3.2.6 Other literal values of translations

A translation has the following literal properties that we did not yet discuss in detail:

  • schema:name
  • schema:datePublished
  • bibo:isbn10
  • bibo:isbn13
  • rdfs:comment
  • rdfs:label

3.3 Contributors

Persons and Organizations

Mention the different roles and how roles are assigned based on specific properties or via general prov:role association

3.4 Location

How geo information is encoded

3.5 Originals

3.6 Works

4. Corpus Serialization

During the project we used several named-graphs to store the data. However, this makes SPARQL queries more complex and hence we also provide a serialization in a single graph for easier accessibility.

4.1 Multi-graph serialization

During the project we used several named-graphs to store the data. i.e. statements about a resource in one graph, but additional statements in other graphs. This allowed us to perform update operations on particular subsets of the statements.

We used the following named graphs to store (generated) RDF per data source.

The following named graphs contain the integrated data, i.e. the BELTRANS database with dedicated records, referring to source records in the respective data source named graphs.

For example, based on properties with textual location information such as ex:book schema:locationCreated "Brussel", stored in one named graph, we had a Python script that created structured referenced data which we stored in another named graph, e.g. ex:book schema:locationCreated ex:brussel . ex:brussel rdf:type schema:Place ; rdfs:label "Brussel" .. In a SPARQL query one can indicate which schema:locationCreated you are interested in by specifying one of the named graphs.

4.2 Single-graph serialization

The consolidated corpus data are available in a single graph. This required a data migration from the more-rich multi-graph setup to a single graph in a semantically sound way. For example, querying schema:locationCreated values from our corpus in a single graph would result in literal values such as "Brussels" as well as instances of schema:Place if we simply would copy everything over. Hence we have to make sure that we get uniform query results by for example using different properties or decide to only migrate one of the representations, in this example this is either the literal or the object.

A. References

A.1 Normative references

[GDPR]
General Data Protection Regulation. URL: https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A02016R0679-20160504
[MARC21Aut]
MARC21 Format for Authority Data. Library of Congress (United States). URL: https://www.loc.gov/marc/authority/
[RFC2119]
Key words for use in RFCs to Indicate Requirement Levels. S. Bradner. IETF. March 1997. Best Current Practice. URL: https://www.rfc-editor.org/rfc/rfc2119
[RFC8174]
Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words. B. Leiba. IETF. May 2017. Best Current Practice. URL: https://www.rfc-editor.org/rfc/rfc8174