You are here

Data2Semantics (From Data to Semantics for Scientific Data Publishers)

The goal of this project is to improve the availability and accessibility of scientific knowledge. This holds for dissemination of results through traditional and scientific publications.

The inspiration for Data2Semantics comes from the wish, both by the scientists as the general public, to have open access to public-funded research data.

In 2013, Nature published a special issue on the future of publishing. Likewise, the EU Commissioner Ms. Neelie Kroes promotes to opening up scientific data. The Data2Semantics project focuses on data management in e-Science but its results can be generalized to other fields, such as Open Government scenarios.

Our central research question is: How to share, publish, interpret and sreuse scientific data on the Web?

To answer this question we need to tackle a number of ICT challenges. We must transform structurally heterogeneous data into shareable form. This must be done while maintaining source integrity to make sure that the original data can be recovered from the transformed and shared data. During the transformation and sharing process we must construct or maintain provenance information: metadata about origin and transformations both equally important to trace the sources of information.

When many scientists publish different datasets, we must find ways to link between semantically heterogeneous datasets. And finally, when scientists publish data over longer periods of time, users must be able to detect and manage semantic concept drift, which describes the difference in meaning in disparate semantic descriptions.

The results in the Data2Semantics project address parts of these questions: data conversion and interpretation tools (TabLinker and plsheet); provenance (re-) construction, tracing and visualization tools (Recoprov, Git2PROV, PROV-O-Matic and PROV-O-Viz); data enrichment and publishing tools (Linkitup, raw2ld and the Hubble Annotator); provenance-aware data analysis tools (Ducktape, RDF graph kernels and compression methods); sampling methods for RDF graphs (SampLD); user friendly access to RDF data (YASGUI) and demonstrators in the clinical domain (Hubble, AERS-LD and Virgil).

These products build on technological innovations developed within Data2Semantics project, often in collaboration with our international partners, such as standardizing provenance representations, integrating provenance tracking in third party tools, development of more scalable kernel methods on RDF graphs, computable approximations of Kolmogorov complexity, graph sampling and compression technology, and spreadsheet interpretation and annotation.

Data2Semantics has several journal publications, amongst which a paper about transparency of the data supply chain (IEEE Internet Computing, on understandable and provenance aware data science) and a SRC paper on enhancing scholarly publication.

Biggest results so far

​Tracking the use of data all the way

Money well spent? Provenance is key in improving the efficiency, reproducibility, integ­rity and trustworthiness of research.Data analysis and transformation are in­creasingly important activities in both scientific research (e.g. climatology) and other fields (e.g. open government data). Unfortunately it is hard to assess the trustworthiness and quality of the results with­out knowledge of what data the outcome was based on, and through what procedure the out­come was reached. This information about enti­ties, activities and people involved in using data is called data provenance.

Our demo shows the integration of data prov­enance tracking and visualization in an existing, popular data science environment. The demo is an application of our work based on the PROV W3C standard, provenance visualization and tracking.

Our work allows for fine-grained tracing of conclusions in scientific papers to intermediate results, other publications, across applications and source data. More.

ICT science question: data are manipulated in a wide variety of tools. It is a grand scientific challenge to construct, recon­struct, communicate and connect data provenance traces. In solving this challenge we have to deal with a lack of standards and integration in tools. Another challenge is to integrate data provenance in environments that scientists already use, without forcing them to learn a new tool or adjust their way of working.

Involved COMMIT/partners: VU University, University of Amsterdam

Finding new drugs by visualizing the effect of their ingredients

Big data does not change the world; insight in big data changes the world. Large-scale visualization provides this insight. Finding new drugs to cure diseases is a hard task. This is because the chemicals in the drug interact in a very complex way with the cells and proteins in the human body. Visualizing this complex network of interactions is important to improve the development of new drugs.

Our demo shows how the interaction between the chemicals in a drug and the proteins in the body can be interactively explored in a rapid way. This rapid interaction makes it possible to get answers while you think, as opposed to waiting for answers, which breaks the train of thought. More.

ICT science question: how to effectively visualize large graphs? This is a hard problem. Graphs that contain more than a thousand nodes tend to become cluttered using most visualization algorithms. The implementation we show in this demo accelerates the visualization by using a graphical pro­cessing unit (GPU). This makes it possible to interact with large graphs (in the order of magnitude of one million nodes).

Involved COMMIT/partners: Synerscope, VU University, Open Phacts Consortium

One-click semantic enrichment of scientific data

The value of scientific data is determined by reuse in academia and industry. Therefore every effort should be made to make all research data available and discoverable.. Researchers, publishers, and funding agen­cies increasingly recognize the impor­tance of publishing the original research data along with traditional journal articles. However, the threshold for publishing data in a way that enables reproducibility and reuse, is still too high for an unsophisticated scientist.

To solve this problem, we have developed a web-based dashboard which incorporates a number of techniques for enrichment of re­search data with appropriate metadata, such as links to relevant external resources and identi­fiers. It also helps the user to upload the results to a popular research data repository were the data can be discovered, verified, and ultimately reused by other researchers. More.

ICT science question: how can scientists publish their research data in such a way that their colleagues can easily repro­duce or reuse them? We approach this problem by automating metadata discovery and enabling the user to publish results as Linked Data following standards and best practices. For the convenience of the user we hide the details of the underlying semantic web technologies.

Involved COMMIT/partners: Figshare, VU University, Elsevier, University of Amsterdam, DANS