The main goal of these pipeline is to define and deploy (semi-) automatic processes to carry out the necessary steps to transform and publish different input datasets for various heterogeneous sources as Linked Data, i.e. automatic processes for data preparation and integration using Linked Data. Hence, they connect different data processing components to carry out the transformation of data into RDF format or the translation of queries to/from SPARQL and the native data access interface, plus their linking, and including also the mapping specifications to process the input datasets. Each pipeline instance is configured to support specific input dataset types (e.g. same format, model and delivery form etc.).
The service enables the execution of pipelines having the following design goals:
- Capability of a pipeline to be directly re-executable and re-applicable (e.g. extended/updated datasets)
- Easy reusability of a pipeline
- Easy adaptation of a pipeline for new input datasets
- Automatic execution of a pipeline as far as possible, though the final target is to create fully automated processes
- Pipelines should support both (mostly) static data and dynamic data (e.g. sensor data)
Following the best practices for linked data publication, these pipelines i) take as input selected datasets that are collected from heterogeneous sources (shapefiles, GeoJSON, CSV, relational databases, RESTful APIs), ii) curate and/or pre-process the datasets when needed, iii) select and/or create/extend the vocabularies (e.g., ontologies) for the representation of data in semantic format, iv) process and transform the datasets into RDF triples according to underlying ontologies, v) perform any necessary post-processing operations on the RDF data, vi) identify links with other datasets, and vii) publish the generated datasets as Linked Data and applying required access control mechanisms.
The transformation process depends on different aspects of the data like format of the available input data, the purpose (target use case) of the transformation and the volatility of the data (how dynamic is the data). For the purpose of the specific pipeline tasks various components were used to reach the final goal of transformation of Linked Data. The list of relevant components identified and used in each pipeline instance will be discussed in the later subsections of this deliverable.
Another aspect of choosing the most suitable tools for transformation of the source data depends on the targeted usage of the transformed Linked Data and the goal for accessing the data integrated with other datasets also influences the preferred tools to be used. Finally, based on how often the data is changing (i.e. rate of change) the transformation methods and the related tools are to be further determined. Based on the above-mentioned characteristics i.e. mode/format of input data sources there are broadly two main approaches for making the transformation for a dataset:
- Data upgrade or semantic lifting, which consists of generating RDF data from the source dataset according to mapping descriptions and then storing it in semantic triple store
- On-the-fly query transformation, which allows evaluating SPARQL  queries over a virtual RDF dataset, by re-writing those queries into source query language according to the mapping descriptions. In this scenario, data physically stays at their source and a new layer is provided to enable access to it over the virtual RDF dataset. This applies mainly to highly dynamic relational datasets (e.g. sensor data) or RESTful APIs.
SPECIAL ACCESS CONDITIONS
Mapping tools embedded in the service
- Data integration
- Data upgrade
- Linked Data publication
- Knowledge discovery