You are here

Data2Semantics (From Data to Semantics for Scientific Data Publishers)

ICT-Challenge

Famous scientist Sir Francis Bacon is alleged to have said “Knowledge is power”. This insight from the 16th century is still relevant in today world where competitive advantage is linked to better knowledge.

The inspiration and motivation of the project is to improve the availability and accessibility of scientific knowledge. This holds both for dissemination of results through traditional publications, as well as through the publication of scientific data. The inspiration for Data2Semantics comes from the increasing call by scientists and the public to have open access to publically funded datasets. Examples of these are recent special issues of Nature on the future of publishing and recent speeches by policy makers such as EU Commissioner Mrs. Neelie Kroes in Stockholm on opening up scientific data. The Data2Semantics project focuses on data management in e-Science. 

Our central research question is How to share, publish, access, analyse, interpret and reuse scientific big data on the Web? To answer this question we need to tackle a number of ICT-challenges. We must transform structurally heterogeneous data into shareable form. This must be done while maintaining source integrity that is to make sure that the original data can be recovered from the transformed and shared data. During this transformation and sharing process we must construct or maintain provenance information, which are metadata about origin and transformations to trace the sources of the information. When many scientists publish different datasets, we must find ways to link between semantically heterogeneous datasets. And finally, when scientists publish data over longer periods of time, we must be able to detect & manage semantic concept drift (differences in meaning in disparate semantic descriptions).

Early results have been achieved in the project in 2012 that address parts of all of these questions and cover all steps in the research cycle: data conversion and interpretation tools (TabLinker and plsheet); provenance (re-)construction tools (Recoprov and Prov-O-Matic); data enrichment tools (Linkitup, raw2ld and the Hubble Annotator); data analysis tools (RDF graph kernels); and demonstrators in the clinical domain (Hubble, AERS-LD).