Dataset Register

This is the NDE Dataset Register, a service that helps users find and discover datasets.

Institutions (such as cultural heritage organizations) register their dataset descriptions with the NDE Dataset Register using its HTTP API. The Dataset Register builds an index by fetching, validating and periodically crawling dataset descriptions.

The HTTP API is documented at https://datasetregister.netwerkdigitaalerfgoed.nl/api.

See the Dataset Register Demonstrator, a client application for this repository’s HTTP API, for more background information (in Dutch).

Design principles

The application follows modern standards and best practices.
The application uses Linked Data Platform (LDP) for HTTP operations.
The application prefers JSON-LD as the data exchange format.
The application uses established Linked Data vocabularies, including Schema.org and DCAT.

Getting started

Validate dataset descriptions

Dataset descriptions must adhere to the Requirements for Datasets. You can check validity using the validate API call.

Submit dataset descriptions

To submit your dataset descriptions to the Dataset Register, use the datasets API call. URLs must be allowed before they can be added to the Register.

Search dataset descriptions

You can retrieve dataset descriptions registered by yourself and others from our triple store’s web interface.

Alternatively, use the SPARQL endpoint at https://triplestore.netwerkdigitaalerfgoed.nl/repositories/registry directly. For example using Comunica:

comunica-sparql sparql@https://triplestore.netwerkdigitaalerfgoed.nl/repositories/registry 'select * {?s a <http://www.w3.org/ns/dcat#Dataset> . ?s ?p ?o . } limit 100'

Or curl:

curl -H Accept:application/sparql-results+json --data-urlencode 'query=select * {?s a <http://www.w3.org/ns/dcat#Dataset> . ?s ?p ?o . } limit 100'  https://triplestore.netwerkdigitaalerfgoed.nl/repositories/registry

Automate registrations

If you want to automate dataset descriptions registrations by connecting your (collection management) application to the Dataset Register, please see the HTTP API documentation.

Run the application

To run the application yourself (for instance if you’d like to contribute, which you’re very welcome to do), follow these steps. (As mentioned above, find the hosted version at https://datasetregister.netwerkdigitaalerfgoed.nl/api).

This application stores data in a GraphDB RDF store, so you need to have that running locally:

docker run -p 7200:7200 docker-registry.ontotext.com/graphdb-free:9.6.0-adoptopenjdk11

When GraphDB runs, you can start the application in development mode. Clone this repository and run:

npm install
npm run dev

Run in production

To run the application in production, first compile and then run it. You may want to disable logging, which is enabled by default:

npm run compile
LOG=false npm start

Configuration

You can configure the application through environment variables:

GRAPHDB_URL: the URL at which your GraphDB instance runs (default: http://localhost:7200).
GRAPHDB_USERNAME: if using authentication, your GraphDB username (default: empty).
GRAPHDB_PASSWORD: if using authentication, your GraphDB password (default: empty).
LOG: enable/disable logging (default: true).
CRAWLER_SCHEDULE: a schedule in Cron format; for example 0 * * * * to crawl every hour (default: crawling disabled).
REGISTRATION_URL_TTL: if crawling is enabled, a registered URL’s maximum age (in seconds) before it is fetched again (default: 86400, so one day).

Run the tests

The tests are run automatically on CI.

To run the tests locally, clone this repository, then:

npm install
npm test

Components

Crawler

The crawler will periodically fetch registration URLs (schema:EntryPoint) to update the dataset descriptions stored in the Dataset Register.

To enable the crawler, set the CRAWLER_SCHEDULE configuration variable. The crawler will then check all registration URLs according to that schedule to see if any of the URLs have become outdated. A registration URL is considered outdated if it has been last read longer than REGISTRATION_URL_TTL ago (its schema:dateRead is older).

If any outdated registration URLs are found, they are fetched and updated in the RDF Store.

Data model

`schema:EntryPoint`

Any URL registered by clients is added as a schema:EntryPoint to the Registrations graph.

Datasets are fetched from this URL on registration and when crawling it.

Property	Description
`schema:datePosted`	When the URL was registered.
`schema:dateRead`	When the URL was last read by the application. The crawler updates this value when fetching descriptions.
`schema:status`	The HTTP status code last encountered when fetching the URL.
`schema:validUntil`	If the URL has become invalid, the date at which it did so.
`schema:about`	The set of `schema:Dataset`s that the URL contains. The crawler updates this value when fetching descriptions.

`schema:Dataset`

Each dataset that is found at the schema:EntryPoint registration URL gets added as a schema:Dataset to the Registrations graph.

Property	Description
`schema:dateRead`	When the dataset was last read by the application.
`schema:subjectOf`	From which registration URL the dataset was read.

`dcat:Dataset`

When a dataset’s RDF description is fetched and validated, it is added as a dcat:Dataset to its own graph. The URL of the graph corresponds to the dataset’s IRI.

If the dataset’s description is provided in Schema.org rather than DCAT, the description is first converted to DCAT. The ‘Based on’ column shows the corresponding Schema.org property. See the Requirements for Datasets for more details.

Property	Description	Based on
`dct:title`	Dataset title.	`schema:name`
`dct:alternative`	Dataset alternate title.	`schema:alternateName`
`dct:identifier`	Dataset identifier.	`schema:identifier`
`dct:description`	Dataset description.	`schema:description`
`dct:license`	Dataset license.	`schema:license`
`dct:language`	Language(s) in which the dataset is available.	`schema:inLanguage`
`dcat:keyword`	Keywords or tags that describe the dataset.	`schema:keywords`
`dcat:landingPage`	URL of a webpage where the dataset is described.	`schema:mainEntityOfPage`
`dct:source`	URL(s) of datasets the dataset is based on.	`schema:isBasedOn`
`dct:created`	Dataset creation date.	`schema:dateCreated`
`dct:issued`	Dataset publication date.	`schema:datePublished`
`dct:modified`	Dataset last modification date.	`schema:dateModified`
`owl:versionInfo`	Dataset version	`schema:version`
`dct:creator`	Dataset creator.	`schema:creator`
`dct:publisher`	Dataset publisher.	`schema:publisher`
`dcat:distribution`	Dataset distributions.	`schema:distribution`

`foaf:Organization`

The objects of both the dct:creator and dct:publisher dataset have type foaf:Organization.

If the dataset’s organizations are provided in Schema.org rather than DCAT, the organizations are first converted to DCAT. The ‘Based on’ column shows the corresponding Schema.org property. See the Requirements for Datasets for more details.

Property	Description	Based on
`foaf:name`	Organization name.	`schema:name`

`dcat:Distribution`

The objects of dcat:distribution dataset properties have type dcat:Distribution.

If the dataset’s distributions are provided in Schema.org rather than DCAT, the distributions are first converted to DCAT. The ‘Based on’ column shows the corresponding Schema.org property. See the Requirements for Datasets for more details.

Property	Description	Based on
`dcat:accessURL`	Distribution URL.	`schema:contentUrl`
`dcat:mediaType`	Distribution’s IANA media type.	`schema:fileFormat`
`dct:format`	Distribution content type (e.g. `text/turtle`).	`schema:encodingFormat`
`dct:issued`	Distribution publication date.	`schema:datePublished`
`dct:modified`	Distribution last modification date.	`schema:dateModified`
`dct:description`	Distribution description.	`schema:description`
`dct:language`	Distribution language.	`schema:inLanguage`
`dct:license`	Distribution license.	`schema:license`
`dct:title`	Distribution title.	`schema:name`
`dcat:byteSize`	Distribution’s download size in bytes.	`schema:contentSize`

Allow list

A registration URL must be on a domain that is allowed before it can be added to the Register. Allowed domains are administered in the https://data.netwerkdigitaalerfgoed.nl/registry/allowed_domain_names RDF graph.

To add a URL:

INSERT DATA { 
    GRAPH <https://data.netwerkdigitaalerfgoed.nl/registry/allowed_domain_names> { 
        [] <https://data.netwerkdigitaalerfgoed.nl/allowed_domain_names/def/domain_name> "your-domain.com" .
    }
}

Name		Name	Last commit message	Last commit date
Latest commit History 695 Commits
.github		.github
assets		assets
shacl		shacl
src		src
test		test
.dockerignore		.dockerignore
.eslintignore		.eslintignore
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
.prettierrc.cjs		.prettierrc.cjs
Dockerfile		Dockerfile
LICENSE.md		LICENSE.md
README.md		README.md
codemeta.json		codemeta.json
jest.config.js		jest.config.js
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

License

netwerk-digitaal-erfgoed/dataset-register

Folders and files

Latest commit

History

Repository files navigation

Dataset Register

Design principles

Getting started

Validate dataset descriptions

Submit dataset descriptions

Search dataset descriptions

Automate registrations

Run the application

Run in production

Configuration

Run the tests

Components

Crawler

Data model

schema:EntryPoint

schema:Dataset

dcat:Dataset

foaf:Organization

dcat:Distribution

Allow list

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

`schema:EntryPoint`

`schema:Dataset`

`dcat:Dataset`

`foaf:Organization`

`dcat:Distribution`