Reliable Data and Knowledge Graphs

May 22, 2020

What makes data reliable? A recent article¹ outlines five properties. It’s

“clean” – formatted uniformly, conforming to certain rules/schema, etc.
grounded in shared meaning spaces – names are unambiguous
supplied with context – where it comes from, how it was sourced
accessible in a standardized format – easily imported
maintained – kept up-to-date

In “What is a Knowledge Graph?”, a vocabulary is presented that maps to the first four properties of reliable data above:

A graph is a set of assertions expressed between entities. Meaning is encoded via graph structure. Cleanliness comes in part from limiting the set of relations (edges) under consideration for an analysis.
An unambiguous graph has these relations and entities unambiguously identified – grounded in ontologies as shared meaning spaces.
A bare statement graph is one that does not also encode provenance, especially justification and attribution, in the graph.
Use of Semantic Web standards like RDF ensure open serialization formats and query tooling.

A knowledge graph is then presented as an unambiguous graph, with a limited set of relations, that encodes provenance – and is thus not simply a “bare statement” graph.

The way that a knowledge graph encodes provenance is important for the fifth property of reliable data above – maintainability. One part of maintenance is about adding new facts, and another part is about managing changes to models to keep analyses reproducible.

Knowledge graphs are self-describing, self-revealing (aka “intuitive”). They represent facts (data) and models (metadata) in the same, machine-readable way. An ontology is a blueprint for your actuals in the graph, that you ship with the graph. Add a schema graph (ontology) to an instance graph (data), and you get a knowledge graph.²

If you ensure that even tiny graphs³ include not only their instance data but also (linked identifiers for) their associated provenance, then maintenance becomes a simple additive process, perhaps with a process akin to periodic log-structured merges for compaction.

Assertions become not-current quickly. By shipping provenance with the data and not merely alongside it, you can ensure data reliability in-process at the query level.

References

Fletcher et al., “Knowledge Scientists: Unlocking the data-driven organization”. arXiv:2004.07917 ↩︎
I got the ships-with-the-blueprint idea, and the crisp algebra of graph-plus-graph-equals-graph, from these presentations by TopQuadrant. ↩︎
Nanopublication Guidelines. Concept Web Alliance Working Draft, Sep 2018. ↩︎