4th Linked Data Quality Workshop: Discussion Session

This is a summary of the discussion session that took place during the 4th Workshop on Linked Data Quality (LDQ) 2017: http://ldq.semanticmultimedia.org

Linked Data Quality (LDQ) is still considered to be a big challenge, both in terms of everyday practice (dealing with buggy datasets) and in terms of its societal impact (LOD would have higher impact if it were of higher quality). Specifically, it was mentioned that the quality of today's LOD Cloud can still be significantly improved.

Question 1: Can we further engage/stimulate community efforts towards improving the quality of the LOD Cloud?

Industry is known to also encounter LDQ issues, but they resolve these issues largely internally and often do not share the results of their cleaning operations. Alternatively, they rely on only premium data providers so as to ensure trustworthiness or they track the provenance of the data and provide source ranking based on their reliability. However, this information is not always shared openly. This results in a situation where closed data is of high quality, but open data is of lower quality.

Further ideas were discussed for engaging the community such as:

  • Billion Triple Bug Challenge (Benchmark, gold standard): find, and possibly resolve, quality issues in a data collection that has 1,000,000,000 bugs
  • Dataset Metadata Challenge
  • Use-case driven challenges

Question 2: Can we provide an inventive-model for companies and institutions to contribute to the improvement of the quality of Linked Open Data?

One of the reasons why it is so difficult to improve the quality of Linked Data may be its relative success: because the number and size of Linked Datasets is still growing rapidly, Linked Data will never be in a `controlled state'. LDQ improvement must be performed constantly and they must be able to deal with data of growing size.

Question 3: Are current LDQ approaches able to run as a continuous improvement process? Are they able to scale in terms of the (ever growing) size of the data?

It is currently difficult to compare different LDQ approaches and to get them published as research papers because the evaluation is problematic. What is currently missing is an LDQ benchmark and more importantly, best practices to prevent quality issues. Also, there should be a provision of a tool to attach quality results to generate a quality report. One idea is to connect LDQ research efforts to large-scale crowdsourcing initiatives (e.g., Wikidata). That is, using a combination of expert and non-expert (e.g., Amazon Mechanical Turk) as well as automated methods to enable a continued quality assessment process.

Other open questions:

  • Can we come up with an LDQ benchmark which would allow the comparison of existing (and future) LDQ approaches?
  • How to formalize ad-hoc data issues?