Skip to content

netwerk-digitaal-erfgoed/dataset-register

Repository files navigation

Dataset Register

This is the NDE Dataset Register, a service that helps users find and discover datasets.

Institutions (such as cultural heritage organizations) register their dataset descriptions with the NDE Dataset Register using its HTTP API. The Dataset Register builds an index by fetching, validating and periodically crawling dataset descriptions.

The HTTP API is documented at https://datasetregister.netwerkdigitaalerfgoed.nl/api.

See the Dataset Register Demonstrator, a client application for this repository’s HTTP API, for more background information (in Dutch).

Design principles

  1. The application follows modern standards and best practices.
  2. The application uses Linked Data Platform (LDP) for HTTP operations.
  3. The application prefers JSON-LD as the data exchange format.
  4. The application uses established Linked Data vocabularies, including Schema.org and DCAT.

Getting started

Validate dataset descriptions

Dataset descriptions must adhere to the Requirements for Datasets. You can check validity using the validate API call.

Submit dataset descriptions

To submit your dataset descriptions to the Dataset Register, use the datasets API call. URLs must be allowed before they can be added to the Register.

Search dataset descriptions

You can retrieve dataset descriptions registered by yourself and others from our triple store’s web interface.

Alternatively, use the SPARQL endpoint at https://triplestore.netwerkdigitaalerfgoed.nl/repositories/registry directly. For example using Comunica:

comunica-sparql sparql@https://triplestore.netwerkdigitaalerfgoed.nl/repositories/registry 'select * {?s a <http://www.w3.org/ns/dcat#Dataset> . ?s ?p ?o . } limit 100'

Or curl:

curl -H Accept:application/sparql-results+json --data-urlencode 'query=select * {?s a <http://www.w3.org/ns/dcat#Dataset> . ?s ?p ?o . } limit 100'  https://triplestore.netwerkdigitaalerfgoed.nl/repositories/registry

Automate registrations

If you want to automate dataset descriptions registrations by connecting your (collection management) application to the Dataset Register, please see the HTTP API documentation.

Run the application

To run the application yourself (for instance if you’d like to contribute, which you’re very welcome to do), follow these steps. (As mentioned above, find the hosted version at https://datasetregister.netwerkdigitaalerfgoed.nl/api).

This application stores data in a GraphDB RDF store, so you need to have that running locally:

docker run -p 7200:7200 docker-registry.ontotext.com/graphdb-free:9.6.0-adoptopenjdk11

When GraphDB runs, you can start the application in development mode. Clone this repository and run:

npm install
npm run dev

Run in production

To run the application in production, first compile and then run it. You may want to disable logging, which is enabled by default:

npm run compile
LOG=false npm start

Configuration

You can configure the application through environment variables:

  • GRAPHDB_URL: the URL at which your GraphDB instance runs (default: http://localhost:7200).
  • GRAPHDB_USERNAME: if using authentication, your GraphDB username (default: empty).
  • GRAPHDB_PASSWORD: if using authentication, your GraphDB password (default: empty).
  • LOG: enable/disable logging (default: true).
  • CRAWLER_SCHEDULE: a schedule in Cron format; for example 0 * * * * to crawl every hour (default: crawling disabled).
  • REGISTRATION_URL_TTL: if crawling is enabled, a registered URL’s maximum age (in seconds) before it is fetched again (default: 86400, so one day).

Run the tests

The tests are run automatically on CI.

To run the tests locally, clone this repository, then:

npm install
npm test

Components

Crawler

The crawler will periodically fetch registration URLs (schema:EntryPoint) to update the dataset descriptions stored in the Dataset Register.

To enable the crawler, set the CRAWLER_SCHEDULE configuration variable. The crawler will then check all registration URLs according to that schedule to see if any of the URLs have become outdated. A registration URL is considered outdated if it has been last read longer than REGISTRATION_URL_TTL ago (its schema:dateRead is older).

If any outdated registration URLs are found, they are fetched and updated in the RDF Store.

Data model

schema:EntryPoint

Any URL registered by clients is added as a schema:EntryPoint to the Registrations graph.

Datasets are fetched from this URL on registration and when crawling it.

Property Description
schema:datePosted When the URL was registered.
schema:dateRead When the URL was last read by the application. The crawler updates this value when fetching descriptions.
schema:status The HTTP status code last encountered when fetching the URL.
schema:validUntil If the URL has become invalid, the date at which it did so.
schema:about The set of schema:Datasets that the URL contains. The crawler updates this value when fetching descriptions.

schema:Dataset

Each dataset that is found at the schema:EntryPoint registration URL gets added as a schema:Dataset to the Registrations graph.

Property Description
schema:dateRead When the dataset was last read by the application.
schema:subjectOf From which registration URL the dataset was read.

dcat:Dataset

When a dataset’s RDF description is fetched and validated, it is added as a dcat:Dataset to its own graph. The URL of the graph corresponds to the dataset’s IRI.

If the dataset’s description is provided in Schema.org rather than DCAT, the description is first converted to DCAT. The ‘Based on’ column shows the corresponding Schema.org property. See the Requirements for Datasets for more details.

Property Description Based on
dct:title Dataset title. schema:name
dct:alternative Dataset alternate title. schema:alternateName
dct:identifier Dataset identifier. schema:identifier
dct:description Dataset description. schema:description
dct:license Dataset license. schema:license
dct:language Language(s) in which the dataset is available. schema:inLanguage
dcat:keyword Keywords or tags that describe the dataset. schema:keywords
dcat:landingPage URL of a webpage where the dataset is described. schema:mainEntityOfPage
dct:source URL(s) of datasets the dataset is based on. schema:isBasedOn
dct:created Dataset creation date. schema:dateCreated
dct:issued Dataset publication date. schema:datePublished
dct:modified Dataset last modification date. schema:dateModified
owl:versionInfo Dataset version schema:version
dct:creator Dataset creator. schema:creator
dct:publisher Dataset publisher. schema:publisher
dcat:distribution Dataset distributions. schema:distribution

foaf:Organization

The objects of both the dct:creator and dct:publisher dataset have type foaf:Organization.

If the dataset’s organizations are provided in Schema.org rather than DCAT, the organizations are first converted to DCAT. The ‘Based on’ column shows the corresponding Schema.org property. See the Requirements for Datasets for more details.

Property Description Based on
foaf:name Organization name. schema:name

dcat:Distribution

The objects of dcat:distribution dataset properties have type dcat:Distribution.

If the dataset’s distributions are provided in Schema.org rather than DCAT, the distributions are first converted to DCAT. The ‘Based on’ column shows the corresponding Schema.org property. See the Requirements for Datasets for more details.

Property Description Based on
dcat:accessURL Distribution URL. schema:contentUrl
dcat:mediaType Distribution’s IANA media type. schema:fileFormat
dct:format Distribution content type (e.g. text/turtle). schema:encodingFormat
dct:issued Distribution publication date. schema:datePublished
dct:modified Distribution last modification date. schema:dateModified
dct:description Distribution description. schema:description
dct:language Distribution language. schema:inLanguage
dct:license Distribution license. schema:license
dct:title Distribution title. schema:name
dcat:byteSize Distribution’s download size in bytes. schema:contentSize

Allow list

A registration URL must be on a domain that is allowed before it can be added to the Register. Allowed domains are administered in the https://data.netwerkdigitaalerfgoed.nl/registry/allowed_domain_names RDF graph.

To add a URL:

INSERT DATA { 
    GRAPH <https://data.netwerkdigitaalerfgoed.nl/registry/allowed_domain_names> { 
        [] <https://data.netwerkdigitaalerfgoed.nl/allowed_domain_names/def/domain_name> "your-domain.com" .
    }
}