Data Structures as Snapshots of Process
Imagine a data system modeled as three parts: an interface, a processor, and a repository. The repository “contains” information. The processor receives symbol streams to alter or retrieve information from the repository, and the processor outputs symbol streams. The interface is the medium, the opaque surface, of symbol-stream exchange between you and the processor.1
What information is “in” the repository? If data depends on a continuous variable, like time, the actual things stored may be “breakpoints”, so that extracting data is a combination of lookup-plus-compute. You may get a value for temperature at the precise time of an observation, but the value is not stored – it is interpolated.
What is the difference between fetching a stored data structure and dynamically generating a record with the same characteristics? The distinction between repository and processor is not so clear. Data structures are snapshots of process; a repository is, in a way, a cache.
Just like it is more expensive to rework a product on an assembly line at the end as opposed to at the beginning, it is more expensive to adapt a data product for a second application the more it has been structured for a first application.
In scientific research, published data products are often structured for “manuscript applications,” so repurposing the data requires reverse engineering PDF-embedded figures/tables or spreadsheet/notebook-embedded formulae, annotations, and charts.
Data structures that embody snapshots of processing are valuable – they reduce latency for predictable data requests for known applications. To maximize FAIR2 characteristics of your data, though, look upstream.
I'd love for you to subscribe.