Reliable Data and Knowledge Graphs
What makes data reliable? A recent article1 outlines five properties. It’s
- “clean” – formatted uniformly, conforming to certain rules/schema, etc.
- grounded in shared meaning spaces – names are unambiguous
- supplied with context – where it comes from, how it was sourced
- accessible in a standardized format – easily imported
- maintained – kept up-to-date
In “What is a Knowledge Graph?", a vocabulary is presented that maps to the first four properties of reliable data above:
- A graph is a set of assertions expressed between entities. Meaning is encoded via graph structure. Cleanliness comes in part from limiting the set of relations (edges) under consideration for an analysis.
- An unambiguous graph has these relations and entities unambiguously identified – grounded in ontologies as shared meaning spaces.
- A bare statement graph is one that does not also encode provenance, especially justification and attribution, in the graph.
- Use of Semantic Web standards like RDF ensure open serialization formats and query tooling.
A knowledge graph is then presented as an unambiguous graph, with a limited set of relations, that encodes provenance – and is thus not simply a “bare statement” graph.
The way that a knowledge graph encodes provenance is important for the fifth property of reliable data above – maintainability. One part of maintenance is about adding new facts, and another part is about managing changes to models to keep analyses reproducible.
Knowledge graphs are self-describing, self-revealing (aka “intuitive”). They represent facts (data) and models (metadata) in the same, machine-readable way. An ontology is a blueprint for your actuals in the graph, that you ship with the graph. Add a schema graph (ontology) to an instance graph (data), and you get a knowledge graph.2
If you ensure that even tiny graphs3 include not only their instance data but also (linked identifiers for) their associated provenance, then maintenance becomes a simple additive process, perhaps with a process akin to periodic log-structured merges for compaction.
Assertions become not-current quickly. By shipping provenance with the data and not merely alongside it, you can ensure data reliability in-process at the query level.