Donny Winston

For each paper with an author from my institution, which of that paper's authors are from my institution?

2024-04-18T10:13:15-04:00

Asked on the OpenAlex Community group:

Is there a way to find out which authors on a paper are from my institution? I downloaded a list of DOI’s from the website, and thought naively that I could look up the index of my institution (by ‘author_institution_ids’ or by ‘author_institution_names’) and then match that index to the list of authors. But soon found out that those indices don’t match because any author can have list multiple affiliations. Any ideas?

— https://groups.google.com/g/openalex-community/c/T0OjBFXSIUg

I recently learned about https://semopenalex.org, which maps openalex data to RDF and thereby facililates graph-oriented queries using SPARQL.

Wanting to take the questioner’s institution as my constraining example, I determined via a search of their name that “Vrije Universiteit Amsterdam” was their institution.

Via the SemOpenAlex ontology explorer, I saw that is the URI designating the class of institutions.

Via their SPARQL interface, I asked for any institution, so that I could see how names were expressed.

PREFIX soa: 

SELECT ?inst WHERE {
  ?inst a soa:Institution .
} LIMIT 1

Got one, . What triples (subject, predicate, object) is this the subject of?

SELECT ?p ?o WHERE {
   ?p ?o .
}

Scrolling through a results table of 30 rows, I see “University of Surrey” as an object for the predicate , i.e. the name term from the fried-of-a-friend (FOAF) vocabulary. Okay, so that’s the predicate SemOpenAlex uses to connect an institution to a name. Now, let’s find “Vrije Universiteit Amsterdam”. Because there may not be an exact match, I’ll ask for institutions with names containing “Amsterdam”:

PREFIX foaf: 
PREFIX soa: 

SELECT ?inst ?instName WHERE {
  ?inst a soa:Institution .
  ?inst foaf:name ?instName .
  FILTER(contains(?instName, "Amsterdam"))
}

Okay, I see that inst has instName “Vrije Universiteit Amsterdam”, and none of the other results appear to be a duplicate of this. I’ve found the institution’s URI. I’ll confirm that I get a single result from the following:

PREFIX foaf: 
PREFIX soa: 

SELECT ?inst ?instName WHERE {
  ?inst a soa:Institution .
  ?inst foaf:name ?instName .
  FILTER(?instName = "Vrije Universiteit Amsterdam")
}

I do. Great. Going back to the ontology explorer, I can see how the model connects institutions, authors, and works:

screenshot of model diagram

I see that SemOpenAlex records a Work as having any number of Authors as creators (via ). I also note that an Author is recorded as being a member of () any number of Institutions.

So here’s what I end up with:

PREFIX foaf: 
PREFIX dcterms: 
PREFIX org: 
PREFIX soa: 

SELECT ?work (GROUP_CONCAT(?author) as ?authors) WHERE {
  ?inst a soa:Institution .
  ?inst foaf:name ?instName .
  FILTER(?instName = "Vrije Universiteit Amsterdam")
  ?author org:memberOf ?inst .
  ?work dcterms:creator ?author .
}
GROUP BY ?work

This query retrieves works authored by at least one author that is also a member of the institution, and lists all member-of-the-institution authors for each work. As of 2024-04-18, this is 182,498 works, and the query executes in under 5 s.

Feeding the Scholarly Need

2024-04-08T10:49:37-04:00

This post marks the re-introduction of a feed for each tag on this blog.

I want this so that I can post without worrying about contributing to “pollution” of the scholarly record. I can accomplish this by tagging posts as #scholarly when I want them to be e.g. fetched by The Rogue Scholar for DOI minting and for subsequent linking to my ORCiD profile.

This post should hopefully be my last act of such pollution. 🙂

For this Hugo-based blog, I accomplished this by creating a layouts/tags/term.atom.xml file with this content:

{{ $taxo := "tags" }}
{{- $pages := .RegularPages  -}}
{{- with .Site.Config.Services.RSS.Limit -}}
  {{- if ge . 1 -}}
    {{- $pages = $pages | first . -}}
  {{- end -}}
{{- end -}}
{{ print "" | safeHTML }}
 xmlns="http://www.w3.org/2005/Atom" xmlns:webfeeds="http://webfeeds.org/rss/1.0">
  
    {{ .Site.Author.name }}
    {{ .Site.Author.orcid }}
  
   uri="https://gohugo.io">Hugo {{ .Site.Hugo.Version }}
  {{ if .Site.Params.feedUUID }}urn:uuid:{{.Site.Params.feedUUID }}{{ else }}{{ .Permalink }}{{ end }}
  {{ with .OutputFormats.Get "atom" }}
  {{ printf ` rel="self" type="%s" href="%s" hreflang="%s"/>` .MediaType.Type .Permalink $.Site.LanguageCode | safeHTML }}
  {{ end }}
  {{ range .AlternativeOutputFormats }}
  {{ printf ` rel="alternate" type="%s" href="%s" hreflang="%s"/>` .MediaType.Type .Permalink $.Site.LanguageCode | safeHTML }}
  {{ end }}
  {{ with .Site.Params.icon }}{{ . | absURL }}{{ end }}
  {{ with .Site.Params.logo }}{{ . | absURL }}{{ end }}
  {{ with .Site.Copyright }}{{ replace . "{year}" now.Year }}{{ end }}
  {{ with .Site.Params.Description }}{{ .  }}{{ end }}
  </span>{{ .Site.Title }} - posts tagged "{{ .Data.Term }}" <span style="color:#f92672">
  {{ now.Format .Site.Params.dateFormatAtomFeed | safeHTML }}
  {{ with .Site.Params.icon96 }}{{ . | absURL }}{{ end }}
  {{ range $pages }}
  
    
    {{ if .Params.author }}
      {{ .Params.author.name }}
      {{ .Params.author.orcid }}
    {{ else }}
      {{ .Site.Author.name }}
      {{ .Site.Author.orcid }}
    {{ end }}
    
    tag:{{ $u := urls.Parse .Permalink }}{{ $u.Hostname }},{{ .Date.Format .Site.Params.dateFormatTag }}:{{ replace $u.Path "#" "_" }}
     rel="alternate" href="{{ .Permalink }}"/>
    </span>{{ .Title }}<span style="color:#f92672">
    {{ .Date.Format .Site.Params.dateFormatAtomFeed | safeHTML }}
    {{ .Lastmod.Format .Site.Params.dateFormatAtomFeed | safeHTML }}
    {{ with .Description }} type="text">{{ . }}{{ end }}
     type="html" xml:base="{{ .Site.BaseURL }}" xml:lang="en">
      {{ printf "" .Content | safeHTML }}
    
  
  {{ end }}

And here is my revised hugo.toml site configuration:

baseURL = ""
copyright = "Copyright © 2020-{year} Donny Winston. All posts licenced under ."
languageCode = "en-us"
rssLimit = 100
title = "Donny Winston"
timeZone = "America/New_York"
theme = "noteworthy"
enableRobotsTXT = true
paginate = 4
summaryLength = 10

[taxonomies]
  tag = "tags"

[author]
	name = "Donny Winston"
    orcid = "https://orcid.org/0000-0002-8424-0604"

# Set to false to disallow raw HTML in markdown files
[markup.goldmark.renderer]
    unsafe = true

[mediaTypes]
  [mediaTypes."application/atom+xml"] # Thank you  !
    suffixes = ["xml"]

# Menu links along the sidebar navigation.
[[menu.main]]
	identifier = "about"
	name = "About"
	url = "/about/"
	weight = 1 # Weight is an integer used to sort the menu items. The sorting goes from smallest to largest numbers. If weight is not defined for each menu entry, Hugo will sort the entries alphabetically.

[[menu.main]]
	identifier = "consulting"
	name = "Consulting"
	url = "/consulting/"
	weight = 2

#[[menu.main]]
#	identifier = "tags"
#	name = "Tags"
#	url = "/tags/"
#	weight = 3

[[menu.main]]
	name = "Archives"
	identifier = "archives"
	url = "/archives/"
	weight = 4

[[menu.main]]
	identifier = "feed"
	name = "Feed"
	url = "/feed.xml"
	weight = 5

[outputFormats]
  [outputFormats.ATOM]
    mediaType = "application/atom+xml"
    baseName  = "feed"

[outputs]
  home = ["ATOM", "HTML"]
  page = ["HTML"]
  section = ["HTML"]
  taxonomy = ["HTML"]
  term = ["ATOM", "HTML"]

[params]
    favicon = "https://files.polyneme.xyz/polyneme-logo-sq-AdH548JPkxW0qf3M5NVTVLt5qdVKFN28AKKvS35A2ndmDMQ0baH90H5APJvIITO2UkFht8rLzZGQTxob8DCqG3KqnsEOczShPKoT.png"
	math = true
	# Blog description at the top of the homepage. Supports markdown.
	description = "Made as simple as possible, but not simpler."

    # Set enableKofi to true to enable the Ko-fi support button. Add your Ko-fi ID to link to your account.
    enableKofi = false
    kofi = ""

	# Add links to your accounts. Remove the ones you don't want to include.
	# Main
	# email = "mailto:donny@donnywinston.com"
	linkedin = "https://www.linkedin.com/in/donnywinston/"

	# Programming
    github = "https://github.com/dwinston/"
	# stackoverflow = "#"

    # Academic
    # googlescholar = "#"
    orcid = "https://orcid.org/0000-0002-8424-0604"

    mastodon = "https://fairpoints.social/@donny"

    dateFormatAtomFeed = "2006-01-02T15:04:05-07:00"
    dateFormatTag = "2006"
    feedUUID = "390a272f-8fa2-425a-b44e-09b477223a39"
    icon = "https://files.polyneme.xyz/polyneme-logo-sq-AdH548JPkxW0qf3M5NVTVLt5qdVKFN28AKKvS35A2ndmDMQ0baH90H5APJvIITO2UkFht8rLzZGQTxob8DCqG3KqnsEOczShPKoT.png"
    icon96 = "https://files.polyneme.xyz/polyneme-logo-sq-AdH548JPkxW0qf3M5NVTVLt5qdVKFN28AKKvS35A2ndmDMQ0baH90H5APJvIITO2UkFht8rLzZGQTxob8DCqG3KqnsEOczShPKoT.png"
    logo = "https://files.polyneme.xyz/polyneme-logo-sq-AdH548JPkxW0qf3M5NVTVLt5qdVKFN28AKKvS35A2ndmDMQ0baH90H5APJvIITO2UkFht8rLzZGQTxob8DCqG3KqnsEOczShPKoT.png"
    mainSections = ["posts"]

# Privacy configurations: https://gohugo.io/about/hugo-and-gdpr/
[privacy]
  [privacy.disqus]
    disable = true
  [privacy.googleAnalytics]
    disable = true
  [privacy.instagram]
    disable = true
  [privacy.twitter]
    disable = true
  [privacy.vimeo]
    disable = true
  [privacy.youtube]
    disable = true

Furthermore, I am testing references introspection ¹ with this post.

Fenner, Martin. “Starting to Include References in DOI Metadata for Blog Posts.” Front Matter, June 16, 2023. https://doi.org/10.53731/6mkrk-dzh02. ↩︎

Community vis-à-vis Forum

2024-03-29T10:43:02-04:00

I think of a community as a state (-ity) of having a purpose in mind (mmun->mean) together (co-), not as an endurable space. I think of a forum as an endurable space, as a doored (from the Latin fores, i.e. door) space of focus (from the French foyer).

How many makes a community? I don’t know. I won’t pretend the Hebrew minyan actually shares etymology with community, but it does helpfully suggest a quorum of ten. How many is too many? Because a community is a purpose-coherent social state, Dunbar’s number suggests a “knee of the curve” of roughly 150.

It seems hard for a set of people to sustain a state of having a purpose in mind together. If the purpose can be fulfilled, then if it is, that set of people can attempt to self-herd themselves to an adjacent or follow-on purpose, and thereby “evolve” “the” community into a different community, a different state of having a purpose in mind together.

A forum can serve a community. If that community evolves, i.e. shifts coherence of purpose, the forum may be sustained if both the pre-shift and post-shift purposes benefit from similar-enough kinds of focus.

It seems that a forum sometimes is made to endure despite the dissolution of the community that motivated its origination as a shelter for focus. I can’t help but think of the chartered corporation as typically being such a forum. The first such charters were granted to incorporate a set of people to sustain, via a shelter for focus, a state of having a purpose in mind together of constructing a particular railroad.

Don't archive you assets — frontier them

2023-04-28T09:07:50-04:00

Don’t archive you assets¹ — frontier them. Research is a living process.² Even when a research project is “finished”, is it really?

ResearchEquals³ clearly articulates the idea of asset-story⁴ continuations⁵. Choose your own adventure – what downstream assets⁶ may use the current “leaf node”⁷?

Your digital research assets are outputs in some stories and inputs in others. Some of these stories are yet to be told — even stories where an asset is an output. Imagine the serendipity of a future research project producing a digital asset with the same SHA256 hash⁸ as a previously registered asset.

With FAIR⁹ stewardship, every digital research object is a frontier asset, with outbound PID-graph¹⁰ edges waiting to be claimed¹¹.

Modeling a Graphical Expression of Materials Data (GEMD)

2023-02-02T10:45:00-05:00

model-memo

The things of concern¹ are materials, processes, measurements, and ingredients. Materials are output by processes and are subject to measurements. A process may take materials as ingredients.

Things are recorded in three ways: as templates, as specs, and as runs. You can record a template for a thing – what might be the case. You can also record a spec for a thing – what is intended. Finally, you can record a run for a thing – what is, or was.

Each thing can have attributes from three categories: properties, parameters, and conditions. A property is something measured or calculated. A parameter is something set. A condition describes an aspect of the thing’s environment.

model-diagram

erDiagram Process ||--|| Material : outputs Material ||--o{ Measurement : subjectTo Ingredient }o--|| Material : from Ingredient }o--|| Process : inputTo Record }o--|| RecordType : has RecordType }o--o| Template : mayBe RecordType }o--o| Spec : mayBe RecordType }o--o| Run : mayBe Record }o--|| Thing : about Thing }o--o| Process : mayBe Thing }o--o| Material : mayBe Thing }o--o| Measurement : mayBe Thing }o--o| Ingredient : mayBe Thing }o--o{ Attribute : has Attribute }o--o| Property : mayBe Attribute }o--o| Parameter : mayBe Attribute }o--o| Condition : mayBe

model-formalism

[
  {
    "@base": "terminusdb:///data/",
    "@schema": "terminusdb:///schema#",
    "@type": "@context"
  },
  {
    "@id": "Thing",
    "@type": "Class",  
    "@abstract": [],
    "has": {
      "@type": "Set",
      "@class": "Attribute"
    }
  },
  {
    "@id": "Process",
    "@type": "Class",  
    "@inherits": ["Thing"],
    "outputs": "Material",
    "hasInput": {
        "@type": "Set",
        "@class": "Ingredient"
    }  
  },
  {
    "@id": "Material",
    "@type": "Class",  
    "@inherits": ["Thing"],
    "subjectTo": {
        "@type": "Set",
        "@class":  "Measurement"
    }
  },
  {
    "@id": "Measurement",
    "@type": "Class",  
    "@inherits": ["Thing"]
  },    
  {
    "@id": "Ingredient",
    "@type": "Class",  
    "@inherits": ["Thing"],
    "from": "Material"  
  },
  {
    "@id": "Attribute",
    "@type": "Class",  
    "@abstract": []
  },
  {
    "@id": "Property",
    "@type": "Class",  
    "@inherits": ["Attribute"]
  }, 
  {
    "@id": "Parameter",
    "@type": "Class",  
    "@inherits": ["Attribute"]
  }, 
  {
    "@id": "Condition",
    "@type": "Class",  
    "@inherits": ["Attribute"]
  },     
  {
    "@id": "Record",
    "@type": "Class",
    "@abstract": [],
    "about": "Thing"  
  },
  {
    "@id": "Template",
    "@type": "Class",
    "@inherits": ["Record"] 
  },
  {
    "@id": "Spec",
    "@type": "Class",
    "@inherits": ["Record"]
  },
  {
    "@id": "Run",
    "@type": "Class",
    "@inherits": ["Record"]
  }        
]

https://citrineinformatics.github.io/gemd-docs/ ↩︎

A Model-Expression Workflow for Connected Content

2023-01-11T05:54:28-05:00

For each layer in the structured-content stack,¹ from least to most volatile (i.e. domain modeling ⟶ content design ⟶ interface design), draft successive model expressions,² from most to least ambiguous (as many expressions as needed to move confidently to the next stack layer).

Structured-content-stack breakdown:

Domain model (object types and relationships)
Content
- content model (content types and attributes)
- content spec (labels and data types)
- content population
Representation
- content-type resource templates (incl. resource transclusions)
- index templates
- collection resource templates
Navigation
- global navigation
- contextual navigation

Model expressions:

model-memo (functional and non-functional requirements ³)
model-diagram
model-formalism
model-implementation

M. Atherton and C. Hane, Designing connected content: plan and model digital products for today and tomorrow. San Francisco, CA: New Riders, 2018. ↩︎
J. M. Żytkow and A. Lewenstam, “Analytical chemistry; the science of many models,” Fresenius J Anal Chem, vol. 338, no. 3, pp. 225–233, Jan. 1990, doi: 10.1007/BF00323013. ↩︎
https://en.m.wikipedia.org/wiki/Non-functional_requirement ↩︎

Key Technical Foundations for FAIRifying Data

2022-12-22T09:25:52-05:00

The key technical foundations for FAIRifying data are (1) ubiquitous persistent identifiers; (2) rich controlled metadata; and (3) granular programmatic access. These foundations provide a basis for FAIR data infrastructure.

This note is inspired by Rory Macneil’s recent interview with Sharif Islam on the FAIR Data Podcast, published on 2022-12-21. In particular, I expand on the Q&A segment starting at PT14M10S.

ubiquitous persistent identifiers (PIDs)

Identifiers must be persistent. Persistence is a matter of service, which needs organizational support. Furthermore, you are playing on hard mode here if you don’t ensure global uniqueness via HTTPS URIs.¹ Crucially, PIDs must be ubiquitous across data holdings. A single PID that addresses all study-publication data elements as an aggregate, e.g. “one DOI for the primary article’s supplemental dataset”, is insufficient.

rich controlled metadata

Metadata makes PIDs findable. Catalogs and search portals use metadata to help you find PID-associated content. Metadata elements must be controlled; that is, so-called controlled vocabularies must be used to boost (a) leverage in tagging and (b) precision and recall in retrieval, which is critical with “big” data-item collections. Furthermore, the controlled metadata needs to be rich — tracking only “minimal” required metadata elements is insufficient. Finally, you are playing on hard mode if your control mechanism does not use PIDs for knowledge organization. A system of least power here is the W3C Simple Knowledge Organization System (SKOS).

granular programmatic access

Programmatic access must be supported. A well-documented, open-standards-based protocol facilitates machine-to-machine interactions to glue things together in a way that is distinct from affordances possible with human-centered interfaces (including bespoke APIs) and portals. This programmatic access must be granular — egress costs scale with data volume delivered, so let users sub-select slices of data. You are again playing on hard mode if programmatic access, and communicating granularity of such access, does not use PIDs. The HTTP protocol and URI scheme were designed for this, as were the W3C Resource Description Framework (RDF) Recommendations.

[update 2022-12-25]: Global uniqueness for HTTPS URIs in practice is ensured either by (1) securing an HTTP URI authority component via the Domain Name System (DNS) or (2) securing a DNS-authority-delegated URI path prefix such as through w3id.org, the ARK alliance, or a DONA handle system (e.g. DOI) agent. ↩︎

Implementing the FAIR Principles Through FAIR-Enabling Artifacts and Services

2022-10-21T13:24:46-04:00

How does a Research Software Engineer (RSE) — often responsible for developing infrastructure to manage and share digital research objects (data, models, code, notebooks, workflows, etc.) — get from “Yes, FAIR sounds great, but how?” to “I better understand what the FAIR principles really mean and how I can put them into practice.”? I hope the diagram below can help.

Relating FAIR-Enabling Resource artifacts, from the FAIR Implementation Profile (FIP) ontology, to services. These services are what you deploy to implement each of the 15 FAIR Principles (from Box 2 of the seminal publication) for any actual given digital research object.

Architecture Patterns for FAIR-Enabling Services

2022-10-17T10:26:37-04:00

I’ve been trying to grok architecture patterns as presented by Percival and Gregory¹ to support domain-driven design and event-driven microservices with Python. I hope you find the diagram below useful.

Relating domain-driven design, event-driven microservices, command-query responsibility segregation (CQRS) + views, and validation (of syntax, semantics, and pragmatics)

A microservices approach seems apt for FAIR-enabling services that need to be composed, flexibly, for any given research artifact’s digital lifecycle. Consider these services:

minter² (F)³
binder² (F)³
resolver² (F)³
index⁴ (F)³
object store⁵ (A)³
transactor⁶ (I)³
harmonizer⁷ (I)³
tracker⁸ (R)³

Consider how you may want to swap one technology choice for a given FAIR-enabling service with another choice, at any time, as part of evolving FAIR infrastructure to which you connect in order to collaborate on and publish / share research artifacts.

H. J. W. Percival and R. G. Gregory, Architecture patterns with Python: enabling test-driven development, domain-driven design, and event-driven microservices, First edition. O’Reilly, 2020. (available online). ↩︎
Example: “Arklet - A basic ARK resolver.” Internet Archive, Oct. 14, 2022. Accessed: Oct. 17, 2022. [Online]. Available: https://github.com/internetarchive/arklet ↩︎ ↩︎ ↩︎
principle addressed: F — Findable, A — Accessible, I — interoperable, R — reusable. ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
Example: “Elasticsearch”. https://www.elastic.co/elasticsearch/ ↩︎
Example: “Amazon Simple Storage Service (Amazon S3)”. https://aws.amazon.com/s3/ ↩︎
Example: “Transactor | Datomic.” https://docs.datomic.com/on-prem/overview/transactor.html (accessed Oct. 17, 2022). https://docs.datomic.com/on-prem/overview/transactor.html ↩︎
Example: “DataHarmonizer.” Centre for Infectious Disease and One Health, Aug. 08, 2022. Accessed: Oct. 17, 2022. [Online]. Available: https://github.com/cidgoh/DataHarmonizer ↩︎
Example: “git - the stupid content tracker.” https://git-scm.com/docs/git (accessed Oct. 17, 2022). ↩︎

From Platforms to Microservices for FAIR Data and Analysis

2022-10-17T09:34:35-04:00

The “one platform¹ to rule them all” is unlikely to be realized for scientific research in any domain. Rather, instead of small and numerous on-premises silos for data + code + compute, we are on track to achieve large and somewhat less numerous cloud-based silos.²

What’s the alternative? A focus on microservices – so-called to emphasize that they generally do not stand alone, but rather are components of larger workflows/services – such as data-slicing and data-summary-layer services that allow you to bring big data to code+compute by effectively subsetting/streaming it.²

But how? One approach is to pursue domain-driven design that is devoid of architecture/orchestration concerns but that yields domain events, wrapped by event-driven microservices that deal with specific technology choices, wrapped finally by entrypoint interfaces driven by user/user-agent personas and their use cases.³

Entrypoints wrap services (orchestration, infrastructure, glue code) that wrap domain conceptualizations.

aka gateway, aka portal, aka virtual research environment, aka… ↩︎
N. C. Sheffield et al., “From biomedical cloud platforms to microservices: next steps in FAIR data and analysis,” Sci Data, vol. 9, no. 1, Art. no. 1, Sep. 2022, doi:10.1038/s41597-022-01619-5. ↩︎ ↩︎
H. J. W. Percival and R. G. Gregory, Architecture patterns with Python: enabling test-driven development, domain-driven design, and event-driven microservices, First edition. O’Reilly, 2020. (available online). ↩︎

FAIR-Enabling Services Redux

2022-10-03T11:18:58-04:00

I have sought to identify and enumerate core FAIR-enabling services. I attempted a five-week experiment to expand on my tentative list, but I did not complete it. The list wasn’t compelling for me.

I have been brewing an updated list of core FAIR-enabling services, which I hope to be less bombastic about. Nevertheless, I want to share with you this list and my thoughts about concretizing them through pedagogically minded demo implementations tied together by a running example that I intend to refine and deploy for a real project.

FAIR-Enabling services:

an identifier minter (e.g. arknoid)
a metadata tracker (e.g. terminusdb)
a metadata transactor (terminusdb schema)
a metadata indexer, to feed a search engine (e.g. elasticsearch)
an identifier metadata resolver (e.g. fastapi with conneg)
a data object retriever (e.g. minio s3 presigned urls)
a metadata harmonizer (e.g. cambria-esque json patch graph)

I’d like to demo a service stack via docker-compose. Some resources I am thinking to consult or leverage directly here are arknoid, TerminusDB, Elasticsearch, FastAPI, MinIO, and Project Cambria.

My intended running example is the incorporation of heliophysics concepts into the Unified Astronomy Thesaurus (UAT) and harmonizing that with the OpenAlex concept scheme so that one may evaluate semantics-fueled improvements to query understanding and thus search-result relevance via the OpenAlex dataset. The OpenAlex dataset has value as a testbed for improvements to in-production search engines such as the SAO/NASA Astrophysics Data System.

Translating Identifiers

2022-09-12T15:59:25+02:00

Don’t. Identifiers should be opaque.

If you’re given an owl:sameAs assertion from a party you trust, use that.

If you need to mint surrogates because what you’re given aren’t Globally Unique, Persistent and Resolvable Identifiers (GUPRIs)¹, either house your inheritance as local parts/suffixes in your global namespace, assert datatype properties to record the historical correspondence, or both.

E. Schultes et al., “FAIR Digital Twins for Data-Intensive Research,” Front. Big Data, vol. 5, p. 883341, May 2022, doi: 10.3389/fdata.2022.883341. ↩︎

Indexing Translators and Traces

2022-09-12T14:02:22+02:00

Is a metadata record “almost” expressed in the same language you used for your filter criteria?

If only the machine knew that what you supplied as “depth” in meters, expressed as a double-precision float, was convertible to the target record’s “d_cm” field, expressed in cm as a string value (to preserve significant digits).

It may be impractical for a user agent to negotiate a bridging of query and content schema in real-time, considering the multitude of candidate paths for attribute and entity alignment.

Perhaps, though, certain families of translators can be indexed for opportunistic recognition given acceptable-compute budgets.

Related to this, certain meandering paths of provenance may be routinely important in selecting resources for reuse.

Rather than repeated union-finding of indexed intersections along these paths, it may be worth indexing whole paths or segments thereof.

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

Indexing Validators

2022-09-07T10:26:18+02:00

Why would one consider indexing validators? Reuse.

The value of reuse seems obvious for structural and semantic specification, i.e. schemas and controlled vocabularies – there is opportunity to perceive two datasets as aligned. But, this alignment is only indicated, not necessarily validated.

Two datasets, A and B, are stated to both conform to schema S. If you wish to verify this, what do you do? You apply a validator V to both. Therefore, it seems that if the same validator V is already stated to have been successfully applied to both datasets A and B in order to verify conformance to S, you will have higher confidence in proceeding to analysis without applying validation yourself, or at least without insisting on comprehensive, compute-intensive validation by default.

A given schema-specification validator may also be relatively sophisticated and transform an input dataset to conform more tightly to the specification, as per Postel’s Law, making it even more valuable to reuse unambiguously identified validators as part of data-integration workflows.

Validators may be composed, e.g. conjunctively as attribute/predicate specs are in Datomic, encouraging granular reuse. However, one could not naively employ conformers-as-validators in such a scheme unless they formed a commutative semigroup (mutually rectifying robustness – Postel would approve!).

Indexing Identifiers

2022-09-06T14:09:02+02:00

Indexing identifiers is key to disambiguating entities.

Wikipedia has disambiguation pages. For example, there are various concepts in mathematics and computing, various computing products, and various companies that identify with the term “Precision”. I made disambiguation pages for same-chemical-formula inorganic crystal structures for the Materials Project.

Indexing identifiers is also key to unifying entities. It’s an open world after all,¹ with a comcomitant non-unique naming assumption. OpenAlex indexes various ID types for a work. For example, http://api.openalex.org/works/https://doi.org/10.7717/peerj.4375 will funnel you to the payload for https://openalex.org/W2741809807, which has an ids field with openalex, doi, mag (Microsoft Academic Graph), pmid (Pubmed), and pmcid (Pubmed Central) IDs.

Finally, indexing identifiers is key to registering and resolving metadata, i.e. relationships between identifiers. Registries include Linked Open Vocabularies (LOV), the Ontology Lookup Service (OLS), the Zazuko Prefix Server, and the OBO Foundry. Resolvers include Identifiers.org and Name-To-Thing (n2t). There is even at least one metaregistry, Bioregistry.io.

Any time you encounter a web service using a “remote data access” style, i.e. exposing a query language via a single access point – SQL, SPARQL, GraphQL, MongoDB, etc. – its highly likely that all entity identifiers are indexed to support efficient retrieval and combination/joining.

Unless you can bask in glorious isolation in a siloed domain/organization. ↩︎

Validating Traces — Syntactically, Semantically, and Situationally

2022-09-04T13:28:00+02:00

How do you validate a reified trace of digital-object provenance?

Is it even possible? This is syntactic validation. Values that should be strings are strings, dates are dates, lists are lists, you know the drill…

Is it plausible? This is semantic validation. This date should be earlier (i.e., “less than” in ISO8601 format) than that date, this number should be an integer multiple of that number, this field’s values are unique across the collection, this field is a reference to an object of that type, etc. Also known as structural validation.

Is it probable? This is situational valuation, i.e. a matter of pragmatics. This mode of “validation” logs not errors per se, but rather warnings. This is the world of statistical process control, of setting thresholds for anomaly detection and tuning your go-ahead logic to align with risk tolerance and chosen strategies for mitigation of failures.

A Disconnect Between FAIR Infrastructure Devs and Product Devs

2022-09-02T18:18:22+02:00

Rory Macneil nails it:¹

This seems to me to be a really important problem. In my experience a lot of the discussion about things like PIDs and controlled vocabularies seems to assume that these things just exist. But when you actually go to try to use them and make them usable in the context of tools that people use in research, that presents a whole additional series of challenges; I think oftentimes those challenges are overlooked or ignored or not even thought about by people who are doing work in PIDs and controlled vocabularies.

Rory Macneil and Nick Garabedian, FAIR Data Podcast, August 31, 2022. https://anchor.fm/fairdatapodcast/episodes/Nick-Garabedian-e1n6vid ↩︎

Validating Translation

2022-09-01T10:20:34+02:00

Given a representation of (meta)data that dcterms:conformsTo some data profile, you may wish to translate it to another data profile.

If a resource is accesible from an HTTP server, then as a client you may negotiate the content representation in a standard way. Traditional content negotiation (aka “conneg”) is limited to file formats, aka syntax rather than semantics, but content negotiation by profile (aka “connegp”) can facililate translation.

There may be many ways to functionally specify data profile negotiation, i.e. translation. Ultimately, one functional profile is employed for a given instance of translation.

Thus, it seems that one way to “validate translation” would be to identify the functional profile employed and trace process outputs for conformance.

I’m really getting into weeds here, aren’t I? My insistence on exploring the cross product of {identifying,validating,indexing,translating,tracing} \(\times\) {identifiers,validators,indexers,translators,tracers} to elucidate FAIR-enabling services is a bit dizzying. I shall cautiously continue.

Semantic Stars Upon Thars

2022-08-31T23:18:16+02:00

To validate is to compute, so indexing metadata for past validation events and caching any detailed payloads can save time and effort.

Why index? To search. Why search? To find relevant (“likely valid”), ranked (“more likely valid”) results.

One may think of validation events as akin to GitHub stars, but with semantics: semantic stars, in that one can filter by validation types and/or validation agents that you trust, by recency if applicable, etc.

Crucially, the qualified nature (i.e., fair:I3) of starring may help a community of practice mind the cautionary tale of The Sneeches.

Who Validates the Validators?

2022-08-30T15:49:05+02:00

Given a fip:Metadata-schema and a validator for it, such as a sh:Validator or a JSON Schema, how do you determine that the validator is…valid? That it speaks the desired fip:Knowledge-representation-language, that it knows all the terms in a desired fip:Structured-vocabulary and checks their usage against a desired fip:Semantic-model? In other words, that it adheres to a doap:Specification?

I do not know. However, I suspect that it is more important to check output rather than input.

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

This page attempts to be a FAIR Point: "view page source" in your browser to see its schema:LearningResource JSON-LD.

Identifying Validation

2022-08-29T21:59:10+02:00

What conveys that data has been validated or is yet to be validated?

How do you identify the nature and process of validation for a given digital object?

Who is involved? What auxiiary resources are involved? Is the process:

Do-it-yourself, with (implicit or explicit) references to validation assets?
Do-it-with-you, with references to validation services?
Do-it-for-you, with references to validation results and/or signoffs?

An example of explicit reference amenable to do-it-yourself validation is the schemaURL field in an OpenLineage RunEvent JSON document, which links to its JSON Schema definition:

{
  "eventType": "START",
  "eventTime": "2020-12-09T23:37:31.081Z",
  "run": {
    "runId": "3b452093-782c-4ef2-9c0c-aafe2aa6f34d",
  },
  "job": {
    "namespace": "my-scheduler-namespace",
    "name": "myjob.mytask",
  },
  "inputs": [
    {
      "namespace": "my-datasource-namespace",
      "name": "instance.schema.table",
    }
  ],
  "outputs": [
    {
      "namespace": "my-datasource-namespace",
      "name": "instance.schema.output_table",
    }
  ],
  "producer": "https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client",
  "schemaURL": "https://openlineage.io/spec/1-0-0/OpenLineage.json#/definitions/RunEvent"
}

Flavors of do-it-with-you validation include checksums and content hashing. You give a service some input with a checksum so that the service can verify that your input is plausible. A service gives you a content hash so that you can verify that its output is plausible. But how do you identify what is being done, and to which field (perhaps it’s done to the object identifier itself)? One useful standard is the HTTP Digest header.

Do-it-for-you signoffs may involve digital signatures (and there is a standards-track HTTP Signature Header).

It’s clear that cryptography must play a big role here:

We should accept the premise that people will not run their own servers by designing systems that can distribute trust without having to distribute infrastructure. This means architecture that anticipates and accepts the inevitable outcome of relatively centralized client/server relationships, but uses cryptography (rather than infrastructure) to distribute trust.¹

M. Marlinspike, “My first impressions of web3,” Moxie Marlinspike, Jan. 07, 2022. https://moxie.org/2022/01/07/web3-first-impressions.html (accessed Aug. 29, 2022). ↩︎

Interview with Martynas Jusevičius

2022-08-29T15:42:22+02:00

This week on Machine-Centric Science, I interviewed Martynas Jusevičius, currently at AtomGraph and based in Copenhagen, Denmark.

Topics we spoke about included: cell-based UI as computational canvases (e.g. Jupyter), personal knowledge graph tools and block UI protocols, the Solid Pod spec and ecosystem for decentralized data ownership and visiting, and the roles both of researchers and those who develop software for them in realizing FAIR-principled implementations.

HAVE A LISTEN »

Quotable Quotes

“The RDF graph data model…seems like the only realistic implementation at this point for the FAIR principles.”

“To me, FAIR data is more or less equal to Linked Data.”

“The software has to be built around these principles. And that’s maybe quite a radical idea because for a long time, data was just like an add-on to software, right? But essentially now it’s the inverse. It’s the data that is at the center – that’s the data-centric paradigm.”

“…there has to be some kind of paradigm shift, both in how researchers see this, but also for those who develop software for researchers, that what scientific publishing produces is not just PDFs…Through fair data, we can look at scientific publishing as this huge network of research artifacts that can be navigated, explored – as a knowledge graph naturally – but also recombined, reused and repurposed in different things.”

If you enjoyed this episode, please consider sharing it with a few friends who might find it useful. Thanks!

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

Tracing Identifiers

2022-08-27T22:42:02+02:00

At a base level, an identifier is simple to trace – it is the sequence (modulo concurrency) of assertions of which it is a part.

In fact, this can be the basis for tracing the representation of a “thing” as the flock of relationships between identifiers, i.e. metadata, that waxes and wanes in association with “the” identifier of the thing.

Translating Identifiers

2022-08-26T21:46:02+02:00

Good identifiers are opaque, so translation is by association – owl:sameAs, skos:exactMatch, or some other relationship. Translation doesn’t follow from reading a sign, but from retrieving a sense.

If metadata is relationships between identifiers,¹ then metadata is the medium of conceptual convergence.

M. Bide, “Standard Identifiers: an overview of the current landscape,” presented at the USPTO Open Meeting: Facilitating the Development of the Online Licensing Environment for Copyrighted Works, Apr. 01, 2015. [Online] ↩︎

Indexing Identifier Services

2022-08-24T22:55:26-04:00

Where do you look for identifiers?

If you’re looking for a URI, the IANA has a registry of schemes, like https, mailto, and tel.

These days, to resolve an identifier, you generally use the https scheme, which has an authority component in its URI format. You can go with content addressing like IPLD CIDs, but that doesn’t solve where to look – it solves knowing that you found the thing (or that you already have the thing).

Authority is hard to persist. So people and organizations pool efforts towards generic authority under the https URI scheme, like hdl.handle.net, doi.org, n2t.net, identifiers.org, purl.org, w3id.org, wikidata.org, etc. Or they pool towards authority with narrower scope, like orcid.org, ror.org, igsn.org, etc. Or they just pursue lasting authority with a new .org or through a trusted .gov, etc.

How do you index identifiers from various sources? There are efforts like LOV for vocabularies, and Crossref and DataCite for DOIs.

I think OpenAlex is setting a nice example of collecting identifiers from various systems and connecting them, along with descriptive metadata.

What about collecting identifier services? Is this of interest? Is it a fool’s errand due to the rise and fall of authority? Or is tracking and using the rising and falling of reputation and reliability, Google-PageRank-style, a way to shepherd researchers to robust persistence of identifiers?

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

Validating an Identifier Service

2022-08-23T16:23:41-04:00

How do you validate that an identifier service provides global uniqueness of minted keys, persistence of bindings, and resolution of keys to descriptive metadata?

The key problem with testing is that a test (of any kind) that uses one particular set of inputs tells you nothing at all about the behaviour of the system or component when it is given a different set of inputs.¹

If you know that a given ID provided by a service is unique, that tells you nothing at all about the uniqueness of another ID provided by that service. You need to understand – be able to reason about – whatever algorithm the service uses to guarantee uniqueness, and trust that the service implements that algorithm.

The key problem is that a test (of any kind) on a system or component that is in one particular state tells you nothing at all about the behaviour of that system or component when it happens to be in another state.¹

If you know that an ID provided by a service is bound to a particular digital object and/or to particular descriptive metadata, that tells you nothing at all about what the service will bind that ID to tomorrow. You need to understand and trust any policy provided by that service regarding persistence of bindings.²³

Running a test in the presence of concurrency with a known initial state and set of inputs tells you nothing at all about what will happen the next time you run that very same test with the very same inputs and the very same starting state…and things can’t really get any worse than that.¹

If you know that a service resolves an ID request to metadata describing a digital object and its location, that tells you nothing at all about how the server will respond to an identical follow-up request. You need to understand and trust the access protocols provided by the service.

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

B. Moseley and P. Marks, “Out of the Tar Pit.” Feb. 06, 2006. [online] ↩︎ ↩︎ ↩︎
J. Kunze, S. Calvert, J. DeBarry, M. Hanlon, G. Janée, and S. Sweat, “Persistence statements: describing digital stickiness,” Nov. 2016, Accessed: Aug. 23, 2022. [Online]. Available: https://escholarship.org/uc/item/2zm9x47c ↩︎
“Permanence Levels and the Archives for NLM’s Permanent Web Documents. NLM Technical Bulletin. 2005 Mar-Apr.” https://www.nlm.nih.gov/pubs/techbull/ma05/ma05_archive.html (accessed Aug. 23, 2022). ↩︎

Identifying Identifying

2022-08-22T19:27:28-04:00

Day 1 of my five-week experiment to elaborate on FAIR-enabling services, and I already find myself fallen flat on my face.

I had wanted to go through motions of brainstorming concepts related to the service of identifying, partition them into concepts, attributes, and relationships in the sense of Sequeda and Lassila’s¹ “Knowledge Report” intermediate representation – for each, draft a table to name it, provide an alternative name or two, a definition, an identifier for the thing, an identifier template for instances of the things culled from sources, and a query to get instance from sources – or at least a nod to how one might proceed with these, particularly for the last two items.

I instead found myself in Philadelphia for longer than anticipated, for reasons I may or may not divulge over a beer, and so here’s what I came up with in the limited timebox I gave myself to push something out today:

Identifying some concepts (attributes? relationships?) about identifying

An identifying service provided guarantees wrt protocol, policy, and algorthims to make good on the guarantees. These guarantees revolve around the nature of requests and responses. Requests wrt identifying are about minting new IDs, binding information to minted IDs, or resolving supplied IDs to bound information. Responses are either the thing identified, information about the thing, or where to get the thing.

Okay, timebox is over. Yes, leaves in the diagram above have gone unexplained. Thankfully, there are more days for thinking, i.e. writing.

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

J. Sequeda and O. Lassila, Designing and building enterprise knowledge graphs. San Rafael: Morgan & Claypool Publishers, 2021. ↩︎

A Five-Week Experiment to Elaborate on FAIR-Enabling Services

2022-08-19T09:48:20-04:00

Yesterday, I proposed that a strategy for implementing the FAIR principles for research data management can focus on ensuring five FAIR-enabling services, which in turn will prompt tactical choices of FAIR-enabling resources that may satisfactorily address each question and thereby produce a comprehensive implementation profile. The purpose of such care in design is to de-risk one’s investment in “going FAIR”, as the cost of systems implementation and maintenance can easily exceed the cost of a design phase by an order of magnitude.

These FAIR-enabling services are, again:

Identifying
Validating
Indexing
Translating
Tracing

Other than cursory remarks, I am yet to elaborate in any detail the behavior I expect from these services. I would like to remedy this over the next five weeks, one week per service, in the order given above.

And there is a constraint I would like to impose on myself: each week will be a five-day progression of notes that reflects the service sequence above. For example, during the first week (on Identifying), I will:

On Day 1: Identify the concepts, attributes, and relationships at play in Identifying.
On Day 2: Assert and validate a set of statements, using elements that I identified the day before, that should hold for Identifying.
On Day 3: Demonstrate a process of Indexing the above in order to efficiently retrieve assertions.
On Day 4: Assert relationships among schemes for Identifying, and attempt Translating from one to another.
On Day 5: Demonstrate a process of Tracing revisions made to the metadata that Identifying yields (i.e., what is returned when an identifier is resolved).

Is this five-week planned experiment ambitious? Yes? Is it too ambitious? Almost certainly yes. Will I attempt it anyway? Yes.

Would I appreciate your day-to-day feedback for course correction? Yes. And I would enthusiastically acknowledge your contribution when collecting and clarifying the sum of each week’s notes.

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

FAIR-Enabling Services

2022-08-18T22:14:09-04:00

(The following is a transcript of my recent podcast episode on this topic.)

There is a FAIR Implementation Profile ontology, and it talks about FAIR-enabling resources. So these are corresponding to questions. For each of the fifteen FAIR principles, this FAIR-enabling resource, the idea is that you’ve identified a challenge or you’ve made a choice about some resource that’s going to help you fulfill that – either that resource is available, or it’s planned, or it’s proposed, or you’re going to phase it out, that sort of thing.

Twelve FAIR-enabling resources have been identified as broad categories that help address each of the challenges with FAIR principles. One is an identifier service. This is a service that provides for any digital object, (1) algorithms guaranteeing global uniqueness; (2) a policy document that guarantees persistence; and (3) resolution of the identifier to machine actionable metadata describing the object and its location.

This is all into under Findable. Another FAIR-enabling resource is metadata schema. So this would be a specification, a schema, that specifies metadata fields describing attributes of data or other digital objects. Another FAIR-enabling resource would be a metadata-data linking schema. So this would be, specifically, a specification – schema – that provides a unique, persistent, ideally bi-directional machine actionable link between metadata and the data they describe. And the final FAIR-enabling resource for the Findable principles is a registry, which is a service that indexes metadata and data and provides a search over that index.

For Accessible, there are three identified FAIR-enabling resources. Communication protocol: so this is a specification for how messages are structured and exchanged. There’s authentication and authorization service. So this is a service that mediates access to digital objects according to specified conditions.

And another FAIR-enabling resource is a metadata preservation policy. So this would be a document that describes the conditions under which metadata are to be provisioned in the future, maybe part of a data management plan.

Okay. Five more FAIR-enabling resources identified. We’re going to Interoperability now. One is a knowledge representation language: a language specification whereby knowledge can be made processable by machines. Another FAIR-enabling resource is structured vocabulary: a controlled list of uniquely identified and unambiguous concepts with their definitions represented preferably using web standards.

Finally, in Interoperable, a FAIR-enabling resource would be a semantic model, a specification that defines qualified relations between entities describing data or other digital objects using structured vocabularies.

The two remaining FAIR-enabling resources, under Reusable, are (1) data usage license – so that’s a document that describes the conditions under which a digital object can be legally used. And finally, a provenance model; a specification – schema – that specifies metadata fields describing the origin and lineage of data or other digital objects.

So these are a bunch of FAIR-enabling resources. I was thinking about this a bit, and I wanted to distinguish between things that actually have to be running in order for data to be alive and for you to actually find it, access it, interoperate with it, reuse it, versus things that are resources that those services will need that are more “one-time” things.

For example, a metadata schema isn’t really a service, so to speak. It’s something that you can do and be done with. You might need to make revisions of it, so maybe there’s some change management procedure. But in terms of the actual service, it isn’t quite like an identifier service, where you want to be given an identifier and be able to know where to go, and resolve that identifier, and determine if you have the right identifier and then get the data, get the metadata. So that’s an actual service that needs to be run. If not continuously, whenever you decide you need identification, you can spin up that service and do that, but it’s an actual service that needs to be run in order for you to have living findability, accessibility, interoperability, reusability.

So of these 12 FAIR-enabling resources, I’ve thought about how to condense them into FAIR-enabling services. What are the actual services that are really important across these that someone needs to worry about if they want a FAIR data ecosystem in their lab and their lifecycle for research, that sort of thing.

I’ve identified these as (1) An identification service, an identifier service. You need to be able to identify things. And this is identifying metadata, datasets, as well as vocabulary things. So this spans, say, F1, the A principles, as well as I2 in terms of making a vocabulary FAIR, being able to unambiguously identify vocabulary terms. So identification is a big service that’s needed.

The second service is validation. So, you could be given a metadata schema and given statements and assertions, but how do you know they actually conform to the schema? Are you going to look those up by hand? Are you going to kind of cross check with a sheet of paper that you have in front of you that says the schema? No, you really want a validation service that will validate statements according to a schema that you’re imposing.

The third service is indexing. So this is related to the registry. You need something that, given a bunch of statements that have identifiers that resolve, a bunch of statements that are valid according to the schema – so you’ve identified, you validated. You then need to collect them and be able to find what you need. And so that involves indexing. So that’s an actual service where you can search the index. An index is the basis of search. Otherwise you’re just doing a full scan of all your statements. You won’t get any leverage. You won’t be able to winnow down with any efficiency at all. So this index thing, this ongoing indexing, where you have an index and you maintain an index and when you identify new concepts or data, assign them identifiers, validate your statements about them, you want to throw them in the index and you want your registry to re-index things. So that indexing needs to be a service.

The fourth FAIR-enabling service to me is translation. So, this is the essence of interoperation. This is the point of having a knowledge representation language and a semantic model where you’re defining qualified relations. The idea being, you have a bunch of metadata and you want to use it for something else. So you need some service to actually translate it. If you have data in some format, you want to be able to translate it. You want these qualified links to know that, if you have metadata of this format, say of a schema.org Dataset and you want a DCAT Dataset, you know the corresponding mapping and you can perform that translation. So that would be a FAIR-enabling service that would leverage resources like semantic model, structured vocabulary, language – ultimately, it would leverage your index as well. So, translation would also be dependent on an index, just like search is.

And the final service that I think is important here is tracing. So, given something, you want to trace “where did it come from?”, and how you can use it. So this connects directly to your, static or not, policies about usage rights, data usage, and your provenance model. And this is how you can actually trace where things are, to determine if you can reuse it. So this is something active that you want. You want something, and again, this would ultimately leverage an index as well. So you’d have a bunch of data objects and metadata objects and vocabulary terms, all of which would need to be identified, so that’d be an identifying service. All of your statements about things, about provenance, mapping for translation, indexing, all of that would have to be validated. So you have the validation service, and then finally you have the indexing. And that puts everything in. And the indexing is the basis of support for search, which I don’t think needs to be a separate FAIR-enabling service – there are various ways of searching over an index, given it. But it also enables this translation based on the semantics that you have in your model and your resources, and tracing to determine, can I use this? What’s the provenance of this? Was it based on things that fell under this certain license? And so, depending on the license of my transformations, this is what I can use it for.

So again, these FAIR-enabling services are: Identifying, Validating, Indexing, Translating, and Tracing. And I hope to go into more detail about how these relate to the FAIR principles and the resources, and sort of elaborate on them individually over the coming weeks.

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

FAIR-Enabling Resources - Identifier Services

2022-08-15T22:27:00-04:00

Here are some identifier services listed as such by FIP Wizard, a free-to-signup online tool to guide a user in creating and publishing a machine-actionable FAIR Implementation Profile (FIP):

Old IGSN

International Generic Sample Number before integration with DataCite
SDN CDI PID | SeaDataNet CDI PID

SeaDataNet Common Data Persistent Identifier
U.S. Department of Energy Office of Scientific and Technical Information (OSTI) Data ID Service

Through the DOE Data ID Service, OSTI assigns persistent identifiers, known as Digital Object Identifiers (DOIs), to datasets submitted by DOE and its contractor and grantee researchers and registers the DOIs with DataCite to aid in citation, discovery, retrieval, and reuse. OSTI assigns and registers DOIs for datasets for DOE researchers as a free service to enhance the Department’s management of this important resource.
URI | Uniform Resource Identifier

URI is a string that provides a unique address (either on the Internet or on another private network, such as a computer filesystem or an Intranet) representing a resource, and implicitly describes where a resource can be found. A resource identification need not suggest the retrieval of resource representations over the Internet, nor need they imply network-based resources at all.

There are currently four resources. I authored the entry for OSTI DOIs. “URI” doesn’t seem like a service. We have a long ways to go.

Schema Translation Infrastructure

2022-08-11T11:13:35-04:00

Repurposing data is hard sometimes. Given a current application’s data-worldview – i.e., its schema – one cannot in general pull in historical data collected for different applications because those applications had different worldviews – i.e., they used different data schemas.

One may perform one-off or ongoing transformations – e.g. ETL jobs – as part of a hub-and-spoke strategy to bring data from past worlds into the “present” world so that all the data can be queried in a uniform way, in the language of the present-application schema.

Unfortunately, the “present” world is a moving target. And “past” worlds may be merely dormant – they may become “present” again if a given application is revisited.

Rather than hub-and-spoke schema convergence and single-timeline data migration, what if schema translation infrastructure sought to reconcile queries across multiple worlds? That is, what if application-X-centric questions could travel to and collect partial information from other-application-centric worlds using the languages (schema) of those worlds?

Building off Ink & Switch’s ideas on edit lenses for schema evolution¹ and off Radul and Sussman’s ideas on propagation networks for computation², as well as off the observed salubrious hourglassing of the Internet’s layered-architecture design³, I’m thinking about how to facilitate effective “schema networking” that acknowledges and embraces the never-ending schema evolution characteristic of data collection efforts by research-producing organizations.

Initial Scribbles

First, I offer a simplified recapitulation of the layered architecture of the Internet:

Layered Architecture for the Internet

Next, I offer a mapping of the above to an analogous six-layered architecture for schema translation:

Layered Architecture for Schema Translation

The physical protocol layer is concerned with data (de)serialization/marshalling and storage. The data-link protocol layer is where ETL happens – how bytes are de-isolated and made accessible to the network. The network protocol layer is where propagation among worldviews/schemas “runs”, with each “cell” (in the parlance of the propagation network literature) a join-semilattice (or is it a meet-semilattice?) world that accumulates partial information via edit-lens functional propagators. The transport protocol layer is RDF over HTTP (FAIR Digital Objects?), the application protocol layer is RDF query (SPARQL, a Datalog, etc.), and the application layer is where specific-worldview-conforming data (i.e., things you plot, perform exploratory data analysis (EDA) on, select/engineer features from to feed to ML-model training, etc.) materialize.

Finally, I offer a rough diagram of how various layer activities and dataflow within/between them may be visualized:

Schema Translation Infrastructure in Action

I want to close by noting that the problem of schema reconnection comes up not only with research laboratory datasets that were collected independently by different teams, but also with datasets collected over a long period of time by a single team as project/application requirements evolve and place adaptation pressure on the “working” schema to undergo several revisions, thus necessitating reconnection among schema versions (i.e. migrations, but not necessarily unidirectional if, say, a sub-team is still using an “old” schema and wants to contribute “new” data).

“Project Cambria: Translate your data with lenses,” Oct. 06 , 2020. https://www.inkandswitch.com/cambria/ (accessed Aug. 01, 2022). ↩︎
A. Radul, “Propagation networks: a flexible and expressive substrate for computation,” Thesis, Massachusetts Institute of Technology, 2009. Accessed: Aug. 11, 2022. [Online]. Available: https://dspace.mit.edu/handle/1721.1/54635 ↩︎
Beck, M. (2019). On the hourglass model. Communications of the ACM, 62(7), 48–57. https://doi.org/10/gj3fnj ↩︎

A Perlisism for Identifiers: Delay Binding

2022-08-08T22:42:20-04:00

Inference based on semantic retrieval is more robust than inference based on syntactic parsing.

identifiers should be as dumb as possible – in other words, should include as little metadata as possible about the thing being identified, leaving all information to be retrieved from metadata repositories rather than inferred from the identifier itself. People always want to infer meaning, and will often try to teach machines to do the same. The problem is that apparent meaning in the structure of an identifier is all too often misleading…¹

In order to be authoritative, identifiers should be assigned as early as practicable in the creation process, but minting is not binding.

Functions delay binding; data structures induce binding. Moral: Structure data late in the programming process.²

Identifier resolution delays binding; identifier structures induce binding. Moral: Structure identifiers late (or never) in the minting process.

Also, structure identifier resolution (i.e. retrieved-metadata structure) late. Metadata is about claims; there may be many and different claims about the same thing. “Multiple resolution”, i.e. making different metadata sources/profiles/formats accessible depending on what a client is trying to retrieve, is akin to functional polymorphism and hence even later binding.

M. Bide, “Standard Identifiers: an overview of the current landscape,” presented at the USPTO Open Meeting: Facilitating the Development of the Online Licensing Environment for Copyrighted Works, Apr. 01, 2015. [Online]. Available: pdf ↩︎
A. J. Perlis, “Special Feature: Epigrams on programming,” SIGPLAN Not., vol. 17, no. 9, pp. 7–13, Sep. 1982, doi: 10.1145/947955.1083808. Online at http://www.cs.yale.edu/homes/perlis-alan/quotes.html. ↩︎

When Do Developers Not Have to Talk to Stakeholders?

2022-08-03T09:56:25-04:00

An ontologist can bridge¹ domain expertise and software development via production of

a semi-informal so-called intermediate representation² that can be understood by domain experts, and
a formal ontology / knowledge graph that represents the domain in a machine-actionable way.

When you do software development, you want to take the human out of the loop as much as possible – really automate it, least amount of manual effort, just minimize that…When I explain to [developers] that an ontology or knowledge graph [is] basically an ontologist talking to stakeholders and making sure that all the implicit knowledge they have is expressed in an explicit structural format that systems can also read…a light bulb [is] lit in their head, like “Oh…so it means we do not have to talk to stakeholders?”…Yes!…You can basically have the ontologist talk to the stakeholders and put it into a format that you just query."³

And if you switch-hit as both a domain expert and a developer, “a little semantics goes a long way” – developing competence in and discipline towards producing intermediate representations can increase your capacity to effectively collaborate/delegate and thus increase professional impact.

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

J. Sequeda and O. Lassila, Designing and building enterprise knowledge graphs. San Rafael: Morgan & Claypool Publishers, 2021. ↩︎
M. Fernández-López, A. Gómez-Pérez, and N. Juristo, “METHONTOLOGY: From Ontological Art Towards Ontological Engineering,” Stanford University, EEUU, Mar. 1997. Accessed: Aug. 03, 2022. [Online]. Available: https://oa.upm.es/5484/ ↩︎
A. Faith and K. Kari, “Data Therapy & Using Ontologies To Translate Business Rules For Devs,” (Jul. 28, 2022). Accessed: Aug. 03, 2022. [Online Video]. Available: https://www.youtube.com/watch?v=lRUYY1pVVqI&t=238s ↩︎

Principles for Robustly Interoperable Digital Objects

2022-08-02T11:03:23-04:00

I have been ruminating on core values in service of stewardship of evolving scientific knowledge.

Specifically, what principles can I lean upon to guide me in the design of robustly interoperable digital objects?

Here is what has jelled so far for me:

Machine Interpretation
- rectifies and amplifies formal modeling
- facilitates machine action
- requires that both bits and semantics are accessible
Least Power
- constrained interpretation promotes interoperability
- lower barrier to support diverse and unanticipated use cases
- behavior is more likely to be understood and predicted with high confidence
Stationary Action
- small deviations from intended sequence-of-processes (i.e. task path) does not require large compensating efforts to reach intended outcome.
- emphases monotonicity, smooth steering, and suitable granularity of progress.
- emphasizes resilience of desired budget over desired schedule rather than minimization of initial budget over initial schedule.
Logic + Control
- facilitates declarative programming
- facilitates flexibility in choices of performance tradeoffs
- promotes reuse
Delta Encoding
- facilitates provenance
- facilitates revision control
- facilitates pub/sub for selective search and retrieval

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

Shotgun Semantics

2022-08-01T16:28:46-04:00

Developers often resort to shotgun parsing: scattering data checks and fallback values in various places throughout the system’s main logic.¹

The habit of scattering parser-like behaviour throughout an application’s code and the resulting inconsistencies in data handling can often lead not just to annoying complications and bugs, but also security vulnerabilities.²

This is about reading data. What about when writing data, when setting the foundations for how it will ultimately “behave” and be interpreted? Are you firing shotshells, or are you slinging webs?

“Project Cambria: Translate your data with lenses,” Oct. 06 , 2020. https://www.inkandswitch.com/cambria/ (accessed Aug. 01, 2022). ↩︎
S. Bratus and M. L. Patterson, “Shotgun parsers in the cross-hairs,” presented at BruCON 2012. Slides: http://langsec.org/brucon/ShotgunParsersBruCON.pdf (accessed Aug. 01, 2022). ↩︎

Interview with Shreyas Cholia

2022-07-29T10:23:10-04:00

This week on Machine-Centric Science, I interviewed Shreyas Cholia, currently at the Lawrence Berkeley National Laboratory in Berkeley, California.

Topics we spoke about included: data lifecycles, edge computing for data firehoses, provenance, standards, broad versus detailed domain vocabularies, scope for common APIs, and identifier leveling.

HAVE A LISTEN »

Quotable Quotes

“Maybe what that really means is that this publication step so to speak just needs to be pushed further upstream”

“Maybe it’s just conceptualizing the data lifecycle as being not so much a linear thing as much as it is just a bunch of different steps that could be applied to the data at different stages, and really any of those steps could happen at any time.”

“There’s a little bit of a disconnect right now…each domain tends to have a lot of detail that gets obscured by these high-level specifications…we’re seeing some interesting friction…things that evolved from different spaces, it’s interesting to see how they’re trying to come together now.”

“The holy grail is…everyone can look at everything and everyone can talk to each other…in this dataset, that’s what this column means and that’s what this field means and that’s how I can compare these two things.”

“There’s a lot more to harmonization than just making sure things are in the same unit.”

“The driving force here is more about machine readability and machine interpretability of the data.”

“That one’s tricky…it’s a little bit of a moving target in terms of where you see scientific value occurring.”

“So much of what matters is at the metadata level…If that’s different for different domains, which it will be, having the ‘one API to rule them all’ doesn’t really make a lot of sense.”

“At the highest level, DOIs are great…there are, though, a lot of identifiers that are kind of not ‘DOI-level’ identifiers…more low-level for tracking and provenance…down to the level of the individual datum…a row in a spreadsheet, or a single JSON object.”

“It’s never too late to start thinking about coming together and trying to standardize your data…Please also spend a lot of time seeing what’s out there and trying to work with existing standards and trying to be a part of the broader ecosystem rather than doing your own thing.”

If you enjoyed this episode, please consider sharing it with a few friends who might find it useful. Thanks!

Method and Structure

2022-07-27T10:18:46-04:00

Arrangements of bits have structure just like arrangements of atoms have structure. Interoperability is about aligning structure. Processing, properties, performance – if their characterization can be repeated, they have information structure.

.">

The materials science tetrahedron (source).

All structure is created by a process. Any process that can be repeated is a method. Every method—indeed, every process—itself has a structure.¹

Method-and-Structure Möbius strip (source)

https://methodandstructure.com/ ↩︎

Interview with Patrick Huck, on implementing FAIR for computed materials data

2022-07-21T11:01:58-04:00

This week on Machine-Centric Science, I interviewed Patrick Huck, currently staff on the Materials Project at the Lawrence Berkeley National Laboratory in Berkeley, California. We talk about choices and considerations in implementing FAIR.

LISTEN NOW »

There are show notes at the link above. Also, I tried to summarize our discussion as a draft FAIR Implementation Profile (FIP).

Talking Points

Career paths for people that are scientists AND software engineers.

The U.S. Department of Energy Office of Scientific and Technical Information (OSTI) DOI Service.

What gets a DOI? Granularity of resources.

Partnering with the Novel Materials Discovery (NOMAD) Laboratory for accessing raw data.

Modeling: with Python classes and with OpenAPI.

API Gateway design for authentication and authorization.

Provenance: for calculation workflows and for structure sourcing (credit to submitters!).

Quotable Quotes

“I think that’s a big topic in science generally. What are the career paths for people that are software engineers that are also scientists or maybe scientists first and software engineers second, and have gone that route? It’s not like there’s H indexes for people like me in terms of publications.”

“[OSTI] provides the infrastructure for minting those DOIs and making sure that those links are always live. We’ve become over the years with now, I think 147,000 DOIs, their biggest data client.”

“We use what’s called robocrystallographer, which gets descriptions based on machine learning that we get based on the information that we calculate about that structure. And then we can take that description auto generated from our database entries and send it as metadata for the DOIs.”

“It’s kind of transparent without even knowing that there’s an API behind it. To the extent that sometimes people talk about the API and they actually mean the client. I think that’s a good thing. People in our space expect those things to be pretty transparent.”

“I don’t think that guarantees longevity on the scale of glacial times.”

“There’s a lot going on in terms of making data FAIR. It’s a little easier for making documents FAIR, like having PDFs findable. On the data level, it becomes a little bit more complicated. And I think that we should strive to get as close as possible to get to FAIR, but it might not for be feasible for every domain.”

If you enjoyed this episode, please consider sharing it with a few friends who might find it useful. Thanks!

"My Data Model Is JSON"

2022-07-20T20:07:34-04:00

“My data model is JSON”. JSON is not a data model. JSON has no semantics in the context of information systems; JSON defines neither how data “behaves” nor how machines can compute with it.

“My data is just JSON”. Your data is never just JSON; you always impose external semantics.

“JSON is easy to understand”. What does the field "harrastukset" mean? In an example JSON document, its value is ["valokuvaus", "pienoismallit"]. Oh, you don’t know Finnish?¹

Ora Lassila, “Will knowledge graphs save us from the mess of modern data practice?,” Knowledge Graphs Conference, New York, NY, USA (2022). [Online]. Available: https://www.lassila.org/publications/2022/KGC2022-Lassila-keynote.pdf ↩︎

A FAIR Digital Object - Inching up the Hourglass

2022-07-19T11:15:09-04:00

Whether deliberate¹ or inevitable², the hourglass architecture of the Internet supports a great diversity of applications implemented using a great diversity of supporting services:

An (incomplete) illustration of the hourglass Internet architecture showing the six layers, from top to bottom: specific applications, application protocols, transport protocols, network protocols, data-link protocols, and physical-layer protocols. A FAIR Digital Object (FDO) protocol could extend the HTTP application protocol.

Could there be a minimal “spanning layer” protocol for FAIR-principled³ applications and services? The FAIR Digital Object (FDO) has emerged as a conceptual nexus for consideration of such a protocol.

There is a working draft online for an FDO framework.⁴ In it, an identifier resolves to a digital object (byte sequence) by default, but one may also request a so-called identifier record. This record would certainly support – via a simple qualified reference – the operation of accessing the identified object’s value-obvious situational information, i.e. the raw byte sequence. Crucially, the identifier record would also support – again, via simple qualified references – operations to access methodological (still value-obvious to certain consumers) and more philosophical (epistemic, ontological, axiological – value typically not obvious) information:

A FAIR Digital Object (FDO) framework - the identifier record, identifier resolution behavior, typing, and metadata schemas and records (source).

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

M. Beck, “On the hourglass model,” Commun. ACM, vol. 62, no. 7, pp. 48–57, Jun. 2019, doi: 10/gj3fnj. ↩︎
S. Akhshabi and C. Dovrolis, “The evolution of layered protocol stacks leads to an hourglass-shaped architecture,” SIGCOMM Comput. Commun. Rev., vol. 41, no. 4, pp. 206–217, Oct. 2011, doi: 10.1145/2043164.2018460. ↩︎
M. D. Wilkinson et al., “The FAIR Guiding Principles for scientific data management and stewardship,” Sci Data, vol. 3, no. 1, p. 160018, Mar. 2016, doi: 10/bdd4. ↩︎
L. O. Bonino da Silva Santos, “FAIR Digital Object Framework Documentation,” Nov. 03, 2021. https://fairdigitalobjectframework.org/ (accessed Jul. 19, 2022). ↩︎

Validation: Syntax, Semantics, and Pragmatics

2022-07-18T10:24:18-04:00

Validation is about preconditions for operation. It may be useful to separate preconditions into three subtypes: syntax, semantics, and pragmatics.¹

Syntax: Rules about what’s grammatically well-formed. Example: A CalculateAqueousStability command may have a set of atomic-composition pairs and a set of ion-concentration pairs. An atomic-composition pair is a string paired with a number between 0 and 1. An ion-concentration pair is a string paired with a number.

Semantics: Rules about what may be syntactically valid but is nonetheless nonsense. Example: A CalculateAqueousStability command may be syntactically valid, but it’s compositions don’t add up to 1, the ion concentrations are physically implausible, etc.

Pragmatics: Rules about contextual appropriateness for processing a syntactically and semantically valid message. Example: an online system can’t efficiently calculate stability for a system of more than 4 atomic elements on-the-fly, so this kind of command is rejected.

Calculate Aqueous Stability on materialsproject.org.

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

H. J. W. Percival and R. G. Gregory, Architecture patterns with Python: enabling test-driven development, domain-driven design, and event-driven microservices, First edition, pp 255-264. O’Reilly, 2020. ↩︎

High-Precision Content Classification Using Hierarchy

2022-07-15T08:58:09-04:00

Content classification is the most fundamental form of holistic content understanding. It helps make your resources findable (F2) and connects them to other resources (I3).

Content understanding represents each piece of content in the index. Relevance of content is a function of query and content understanding. Query understanding represents each search query as a search intent.

Classification maps a document to one or more predefined categories. We can do so using hand-tuned rules or machine learning.¹ The categories can be a flat list, or they can be arranged in a hierarchical (single-hierarchy or faceted) taxonomy².

If the categories are hierarchical and broadly applicable (I1), then a classifier might take advantage of the hierarchy and more confidently map content to a non-leaf category (e.g., mapping a material to “Semiconductor” rather than “High-Gap Semiconductor” or “III-V Semiconductor”). In general, it’s best to map value objects and entities to leaf categories.

Reducing the number of labels substantially improves the precision of a classifier. But filtering out infrequent labels decreases coverage, and it’s not clear that out-of-scope examples will be recognized in production.(F4) A more robust approach is to leverage the hierarchical nature of a taxonomy and roll up infrequently used labels to their parent or other ancestor categories.

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

G. Ingersoll and D. Tunkelang, “Course Notes for ‘Search with Machine Learning.’” Corise Education, Jun. 20, 2022. [Online]. Available: https://corise.com/course/search-with-machine-learning/ ↩︎
D. Tunkelang, “Taxonomies and Ontologies,” Medium, Aug. 30, 2017. https://queryunderstanding.com/taxonomies-and-ontologies-8e4812a79cb2 (accessed Jul. 15, 2022). ↩︎

Taxonomy Pruning for Query Classification

2022-07-14T13:39:05-04:00

When providing a search interface (F4), you can improve precision significantly by classifying a user’s query, assuming you are able to classify your content.

If you have a category taxonomy and labeled queries, you can train a classifier in order to dynamically assign a category to a query. A benefit of taxonomic hierarchy is that, while a labeled query may be labeled with a leaf node of the taxonomy, you can prune, i.e. “roll up”, the taxonomy to ensure sufficient signal for training. This helps to maintain recall when filtering query results by the query’s classification.

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

An Objective Function for Code Refactoring

2022-07-08T08:29:31-04:00

Have you ever set an objective function for code refactoring, where, for every proposed total change (e.g. reviewable pull request), you seek to maximize the change in this function? An example:¹

\[ \log_2(pct_{LOC\tested}) * pct{importables\documented} * pct{LOC\nostate} \over n{LOC} \]

Good (numerator stuff):

Percent Lines of Code (LOC) covered by a test. Sublinear growth here, i.e. diminishing returns on “getting to 100%”. An off-the-shelf tool like Coverage.py will be fine here.
Percent “public” units, i.e. non-underscored module importables - functions, classes, constants, variables/objects, covered by tutorial/how-to-guide/explanation (so excluding reference) documentation. Maximizes code consumers’ ability to understand functionality without have to dive into the codebase. Prior art on measuring this here.
Percent LOC unaffected by state (i.e. avoiding getting values from or calling methods on long-living references). Pure-functional code is easier to reason about (e.g. via a simple substitution mode of execution) and thus more maintainable. My strategy for measuring this would be to designate certain (sub)packages/modules as purely functional.

Bad (denominator stuff):

total LOC

What metrics correlate with code-refactoring success in your experience? These? Others?

If the equation doesn’t render for you: ↩︎

Complexity Is Carbon

2022-07-06T11:37:32-04:00

Some energy infrastructure emits carbon. Some data infrastructure emits complexity.

There is essential carbon emission, like humans exhaling CO₂. And there is incidental, non-essential carbon emission, like humans burning fossil fuels.

There is essential complexity in data (and software code), like that pertaining to modeling your subject matter and your application domain. And there is incidental complexity – “incidental is Latin for your fault.” ¹.

How can we eliminate incidental carbon emissions from energy infrastructure? Electrify everything.²

How can we eliminate incidental complexity emissions from data infrastructure? Triplify everything.³

Rich Hickey, “Simple Made Easy”, Strangle Loop conference (2011). (transcript). ↩︎
S. Griffith, Electrify: an optimist’s playbook for our clean energy future. Cambridge, Massachusetts: The MIT Press, 2021. ↩︎
G. Schreiber and Y. Raimond, “RDF 1.1 Primer.” World Wide Web Consortium (W3C), Jun. 24, 2014. [Online]. Available: http://www.w3.org/TR/rdf11-primer/ ↩︎

These Are All Just Persistent URLs, No?

2022-07-05T09:06:03-04:00

I am beginning to walk through each question of the FAIR Implementation Profile (FIP) Ontology. My goal is to construct and share a populated model of people’s articulations – aka declarations – of choices they’ve made or with challenges they face with regard to addressing each question, as well as the considerations they associate with any such choice or challenge.

The first question for which I’m seeking declarations is F1-D:

What globally unique, persistent, resolvable identifiers do you use for datasets?

I’ve gotten some great responses so far, mostly about people choosing to use the Handle (incl. DOI) or ARK systems.

I got a great question from my former group-mate Shyam Dwaraknath:

In the end these are all just persistent URLs no?

For all intents and purposes, yes. Practically, if you don’t give someone a resolving¹ HTTP(S) URL, such that they can Locate and retrieve the Resource given a Uniform Identifier (i.e., URI\(\implies\)URL), they should be able to straightforwardly construct one.

Handles and ARKs use their compact forms to communicate

an intention of persistence, and (related to this)
a URL-construction protocol in case they are

(a) not communicated as URLs, or

(b) they are, but the URLs don’t resolve.

If you see e.g. 10.1038/sdata.2016.18 somewhere, the hope is you will grok that the \d+[\.\d+]+/.+ pattern (period-delimited numbers, then a /, then stuff) is likely a Handle, so you will try putting https://doi.org/ or https://hdl.handle.net/ before it. There either need to be well-known public Handle HTTP Proxy servers, or you search around for “Handle proxy server”. You’ll also see doi:10.1038/sdata.2016.18 sometimes. Same principle. The hope is you know how to URLify it trivially.

The form of an ARK is similar in intent. The hope is that if you see e.g. ark:57802/dw0/agu/6045 somewhere (for ARKs, the ark: prefix is part of the ID form, even in URL paths), you’ll think “this ID is intended to be persistent – an archival resource key” and “I hope some name mapping authority (NMA) is publicly resolving ark:57802 IDs”. The well-known public ARK HTTP Proxy is https://n2t.net, and e.g. https://n2t.net/ark:57802/dw0/agu/6045 passes through to https://ns.polyneme.xyz/ark:57802/dw0/agu/6045 because https://ns.polyneme.xyz is registered there as the NMA for the name assigning authority (NAA) ark:57802.

Other persistent ID systems that imply/offer HTTP URLs have tighter coupling to the DNS domain responsible for resolving the IDs. Some of these systems are intended for general use, such as https://purl.org/ and https://w3id.org/.

In these systems, prefixes are not allocated like with Handles or ARKs, and there is no emphasis on prefixes being semantically opaque so as to increase the likelihood of continued commitment to persistence if/when stewarding organizations change names. Rather, prefixes are claimed, like http://purl.org/dc (serving e.g. http://purl.org/dc/terms) and https://purl.org/dw (serving e.g. https://purl.org/dw/squirrel), or https://w3id.org/nmdc (where currently, all path extensions, e.g. https://w3id.org/nmdc/nmdc-schema, resolve to the same page).

Other DNS-coupled systems are socially positioned as providing specific types of persistent identifiers. Such systems include the World Wide Web Consortium (W3C) https://w3.org/ namespace for standards (e.g. https://w3.org/ns/dcat), the Open Researcher and Contributor ID (ORCID) https://orcid.org/ (e.g. https://orcid.org/0000-0002-8424-0604), the International Generic Sample Number (IGSN) https://igsn.org (e.g. https://igsn.org/IEWFS0001), and the Research Organization Registry (ROR) https://ror.org/ (e.g. https://ror.org/02jbv0t02).

If/when any such special-purpose, domain-name-tied system cannot fulfill persistence, it is hoped that there will be (a) an adopter organization and (b) sufficient signage (e.g. minimal maintenance of the old domain as a static notice) to enable programmatic workarounds, like the case of the Global Researcher Identifier Database (GRID) https://grid.ac/ being passed to ROR for stewardship.

Any HTTP URL is technically resolvable. Whether it actually resolves in response to an HTTP request is a matter of service. ↩︎

The ARK System of Persistent Identifiers (PIDs)

2022-07-01T16:00:01-04:00

The Archival Resource Key (ARK) system is an alternative to the Handle system to satisfy FAIR’s F1 Principle.

Similar to the Handle system, naming authority for ARKs is distributed by allotting prefixes. However, there is no “pre-prefix” administration via a small number of credentialed multi-primary administrators, and there is currently no fee per allotted prefix, called a Name Assigning Authority Number (NAAN).

Another difference between the Handle and ARK system is in distinguishing between a name assigning authority (NAA), i.e. identifier minting, and a name mapping authority (NMA), i.e. identifier resolution. With the Handle system, NAA and NMA functions are administered by the same organization. With the ARK system, an NAA may be its own NMA, may migrate from one NMA to another, or may have multiple NMA service providers.

For more on ARKs, see my post on Object Persistence: A Matter of Service, the most recent specification, and the ARK Alliance website.

The Handle System of Persistent Identifiers

2022-06-30T08:52:40-04:00

The Handle system is a popular choice for the assignment and resolution of globally unique, persistent identifiers. Governance is centralized with the DONA Foundation, and administration is distributed among so-called Credentialed Multi-Primary Administrators (MPAs), of which there are currently nine. You’ve likely heard of at least one MPA: the International DOI Foundation.

Each MPA is assigned a number. The DOI Foundation has 10. This is why all DOIs begin with 10.. Each MPA can in turn give a “complete” prefix (everything before the /) to a so-called “naming authority”. The DOI Foundation¹ gave the Nature Publishing Group (now Springer Nature) 10.1038, for example, who in turn can create as many local names as they’d like, such as 10.1038/sdata.2016.18.

How do handles get resolved? Each handle prefix may have its own administrator, and administration of handles is distributed, similar to the Domain Name System (DNS). The Handle system is compatible with DNS, but does not require it. In practice, there are known public HTTP proxy servers such as https://hdl.handle.net/ and https://doi.org/ that allow resolution of handles as URLs. Hence, https://doi.org/10.1038/sdata.2016.18 is resolvable.

Another big MPA is the Corporation for National Research Initiatives (CNRI). CNRI governed the Handle system before passing it off to the formed-for-this-purpose DONA Foundation in 2015. Before this, CNRI assigned MPA-esque numbers to a bunch of organizations, and these continue to be administered by the CNRI-as-MPA, even though it’s assigned number is 20 now. For example, CNRI assigned 1721.1 to MIT, which is used for it’s DSpace repository. My PhD thesis was assigned 1721.1/71495. So, https://hdl.handle.net/1721.1/71495 and https://doi.org/1721.1/71495² (and https://dspace.mit.edu/handle/1721.1/71495) all get you to it.

You can inspect Handle prefix records, which are analogous to DNS records, via https://hdl.handle.net/. For example, https://hdl.handle.net/1721.1 lets you know that this prefix is administered by MIT DSpace via the CNRI MPA (see the /20.ADMIN-containing HS_ADMIN entry).

So how do you start minting and resolving Handles?

Become a credentialed MPA? I don’t know, that seems hard for an individual researcher. There are only nine credentialed by DONA.

Request a completed prefix from an existing MPA, e.g. something that matches 10.\d+ from the DOI foundation? Yes, you can do that. MPAs typically charge registration and annual service fees per allotted prefix (i.e., the whole .-delimited number before the / in a handle). In the case of the DOI Foundation, they delegate to e.g. Crossref to assign 10. prefixes. In this case, for additional fees, Crossref will resolve identifiers for you (beyond assigning you a prefix to mint as many as you’d like).

A final method is to find a service provider that has a complete prefix and will let you mint handles under their prefix, or will mint them for you. This is the most typical route for researchers. For example, Zenodo got 10.5281 from DataCite (another 10.\d+ service provider the DOI Foundation delegates to), and they’ll give you a full handle when you upload stuff to https://zenodo.org. ResearchEquals got 10.53962 from CrossRef, and they’ll give you one for anything you put on https://www.researchequals.com/. And of course, journal publishers typically give you one when you publish an article with them.

Actually, one of its registration agencies (RAs), Crossref. The DOI Foundation doesn’t give out prefixes directly. Individuals request prefixes from RAs, not from the DOI Foundation. Thank you Ed Pentz for clarifying this. [footnote added 2022-07-01] ↩︎
Wait, what? It’s a DOI? Nope. DOIs are Handles that start with 10.. https://doi.org/ is (currently) a public HTTP proxy server that resolves all Handles, regardless of prefix. ↩︎

What globally unique, persistent, resolvable identifiers do you use for datasets?

2022-06-29T10:25:41-04:00

What globally unique, persistent, resolvable identifiers do you use for datasets? I want to know about either (a) a challenge you’re facing, and what you’ve tried; or (2) a choice you recently made, and how it’s going.

Context: For each question of the FAIR Implementation Profile (FIP) Ontology, I want to collect and discuss folks’ choices and challenges on my podcast, Machine-Centric Science.

Please email me at podcast@polyneme.xyz with:

an email subject of either (a) “FIP F1-D challenge” or (b) “FIP F1-D choice”,
(preferred) an email attachment of a one-minute-max audio recording so that you can asynchronously mini-guest on the podcast 🙂,
an email body that at minimum says either (a) “CC0” or (b) “CC-BY …”, where “…” is how you wish to be attributed (e.g. your name, your name and location, your name and affiliation, etc.).

You may also write out your challenge or choice in the email body, in which case I will read it aloud. If you choose a CC0 license, I will by default keep you anonymous unless you give me attribution info.

Extra credit:

Use an Open Researcher and Contributor ID (ORCiD) for your attribution info. Obtaining one is free.
If applicable, give me an ANZSRC-2020-FoR code (the browsable tree takes a second or two to load) for your typical field of research, whether two-digit (most broad), four-digit (narrower), or six-digit (narrowest in this scheme).

If you really want to send me multiple choices and/or challenges, please send them as separate emails. Also, feel free to spread this message far and wide. Thanks!

Findability → Known-Item Search, Discoverability → Exploratory Search?

2022-06-28T10:15:40-04:00

I keep confusing findability and discoverability. It seems that findability is often equated to known-item search, and discoverability to exploratory search.

Known-item search is compatible with “instant search”, aka search-as-you-type interfaces. Exploratory search is compatible with “autocomplete” (incl. re-spelling, infix matching, synonym substitution, etc.) interfaces.

Recommendation can be a part of exploratory search, i.e. bundled in response to a user’s pulling for relevant information. It can also be pushed independently of deliberately registered search intent – via notifications, email digests, etc.

Does the latter activity – the pushy one – “count” for discoverability? I can imagine such activity being framed as periodic re-running of an exploratory-search query on behalf of a user, with query-independent factors for retrieval and ranking being varied over time.

Is an Ontology 'better' than a Relational Data Model?

2022-06-27T10:40:08-04:00

Is an ontology “better” than a relational data model? “More expressive power” doesn’t always mean “better”. However, ontologies allow you to ratchet up power while keeping logic in data structures.

By “relational data model”, folks typically mean “SQL model”. In the RDF world, this is roughly on par with a SHACL model, i.e. a model that expresses constraints on the shapes of entities and on the so-called “primitive” types of their properties/attributes/columns/fields (string, boolean, integer, etc.). Both SHACL and SQL can set the ranges of properties to be “reference” types, which is indirect in SQL through primitive-typed (usually an integer or string) foreign keys.

An ontology language allows for more expressive data modeling than shape and attribute validation, while staying at the level of declarative data description. In the RDF world, OWL lets you express notions of commonality and variability familiar from object-oriented programming such as classes, subclasses, and properties – you don’t need a software-defined object-relational mapping (ORM) layer. You can also express certain constraints for and between classes, entities (individuals), and properties.

There’s nothing you can express using ontologies that you cannot also express using a SQL data model plus a general programming language, or just a programming language. So why declaratively model data at all? Why SQL then and not just CSV files if you’re going to load the data into Python et al. anyway? The rule of least power (https://en.wikipedia.org/wiki/Rule_of_least_power). Ontology languages give you more expressive power than shape-constraint languages while reducing the risk of non-reusability of your modeling logic for unforeseen applications.

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

Leave Beacons in Code

2022-06-24T11:51:48-04:00

Leave beacons in your code. I would have avoided a silly error if a variable named xgb_train_data would have been named, for example, xgb_train_data_filepath instead.

When you can’t leave globally unique, persistent, resolvable identifiers (GUPRIs), mind your beacons.

References:

F. Hermans, The Programmer’s brain: what every programmer needs to know about cognition, pp28-30. Shelter Island, NY: Manning, 2021.
M. Crosby, J. Scholtz, and S. Wiedenbeck, “The roles beacons play in comprehension for novice and expert programmers,” Jul. 2002, [Online]. Available: https://www.researchgate.net/publication/228592285_The_roles_beacons_play_in_comprehension_for_novice_and_expert_programmers

CFF for Machine-Actionable Software Citations

2022-06-23T10:43:21-04:00

Add a CITATION.cff file to your git repository. The Citation File Format is automatically rendered on GitHub and usable by Zenodo and Zotero.

Already have a DOI? Let’s see about a DOI-to-CFF tool. Looks like there’s doi2cff, but it’s currently restricted to DOIs on Zenodo that are tagged as software releases.

So, this could work:

pip install git+https://github.com/citation-file-format/doi2cff
doi2cff init 10.5281/zenodo.6591863

CITATION.cff file has been written

Cool. But what if you want to cite something else? Let’s go from DOI to BibTeX, and BibTeX to CFF.

Let’s get BibTeX via content negotiation:

curl -LH "Accept: application/x-bibtex" \
    https://doi.org/10.5281/zenodo.5570279 \
    >> refs.bib

# cat refs.bib
@article{https://doi.org/10.5281/zenodo.5570279,
  doi = {10.5281/ZENODO.5570279},
  url = {https://zenodo.org/record/5570279},
  author = {Canon, Shane and Christianson, Danielle and Duncan, William and Eloe-Fadrosh, Emiley and Fagnan, Kjiersten and Hays, David and Huntemann, Marcel and Lebedeva, Sofya and Miller, Kayd and Miller, Mark and Mouncey, Nigel and Mungall, Chris and Reddy, Tbk and Rudolph, Marisa and Sarrafan, Setareh and Sundaramurthi, Jagadish Chandrabose and Unni, Deepak and Vangay, Pajau and Wood-Charlson, Elisha and Ahmed, Faiza and Baumes, Jeffrey and Davis, Brandon and Anubhav, Fnu and Borkhum, Mark and Bramer, Lisa and Corilo, Yuri and Lipton, Mary and Mans, Douglas and McCue, Lee Ann and Millard, David and Piehowski, Paul and Prymolenna, Anastasiya and Purvine, Samuel and Richardson, Rachel and Smith, Montana and Stratton, Kelly and Babinski, Michal and Chain, Patrick and Davenport, Karen and Flynn, Mark and Hu, Bin and Kelliher, Julia and Li, Po-E and Lo, Chien-Chi and Jackson, Elais Player and Shakya, Migun and Xu, Yan and Drake, Meghan and Martin, Stanton and Wilson, Bruce and Winston, Donny},
  keywords = {microbiome, data science, data infrastructure, science gateway},
  title = {The National Microbiome Data Collaborative: a data science ecosystem for microbiome research},
  publisher = {Zenodo},
  year = {2021},
  copyright = {Creative Commons Attribution 4.0 International}
}

Sweet. There is a command-line tool for converting CFF to other common formats, but that’s not what we want here. Ah, here we go: bibtex-to-cff.

To save you a bit of hassle, I’ve packaged this PHP tool as a Docker image, polyneme/bibtex-to-cff. Also, it can’t (currently) handle URLs as BibTeX entry ids, so for this case I changed @article{https://doi.org/10.5281/zenodo.5570279, in the above example to @article{10.5281.zenodo.5570279, (it seems fine with periods).

Here goes:

docker run --rm \
    -v $(pwd):/usr/src/app/scratch \
    polyneme/bibtex-to-cff \
    scratch/refs.bib --id 10.5281.zenodo.5570279 \
    > CITATION.cff

Done. Add this to the root of your git repo, and congratulate yourself for including machine-actionable citation metadata with your software.

You can build the image yourself by cloning the monperrus/bibtexbrowser GitHub repo, adding the following to a new Dockerfile in the repo directory:

FROM php:7.4-cli
COPY . /usr/src/app
WORKDIR /usr/src/app
ENTRYPOINT [ "php", "bibtex-to-cff.php" ]

, and then docker build -t bibtex-to-cff ..

PageRank of Linked Open Vocabularies (LOV)

2022-06-15T13:28:17-04:00

Datasets are easier to reuse if they use standards that are well-established, particularly in a given domain.

A first approach is to ask around – ask people with whom you coauthor , people you trust in your field, etc.

A follow-on approach is to examine the “graph reputation” of relevant standards, particularly if they may be represented as resources with outbound links. We can use the PageRank algorithm, just like Google uses it to index the web of documents.

An an example, here I outline an initial approach to find the “most reputable” of Linked Open Vocabularies’ 778 vocabularies.

My starting point is having the API responses for each vocabulary so that lov is a list of dicts, each with keys url: str and api_response: dict.

Collect all outbound links:

for entry in lov:
    entry["outbound_links"] = entry.get("outbound_links", set())
    for version in entry["api_response"].get("versions", {}):
        for field, value in version.items():
            if field.startswith("rel") and isinstance(value, list):
                entry["outbound_links"] |= {v for v in value}

Prepare a stream of self_link, outbound_link pairs:

with open("lov-outlinks.csv",'w') as f:
    for entry in lov:
        url = entry["url"]
        for link_url in entry["outbound_links"]:
            f.write(f"{url},{link_url}\n")

In a file, e.g. lov_pagerank.py:¹

if __name__ == "__main__": # for `spark-submit`
    sc = SparkContext(appName="LovRankings")
    match_data = sc.textFile("lov-outlinks.csv")

    xs = match_data.map(get_linking).groupByKey().mapValues(initialize_for_voting)

    for i in range(20):
        if i > 0:
            xs = sc.parallelize(zs.items())
        acc = dict(xs.mapValues(empty_ratings).collect())
        zs = xs.aggregate(acc, allocate_points, combine_ratings)

    ratings = [(k, v["rating"]) for k, v in zs.items()]
    for i, (vocab, rating) in enumerate(
        sorted(ratings, key=lambda x: x[1], reverse=True)[:100]
    ):
        print("{:3}\t{:6}\t{}".format(i + 1, round(log2(rating + 1), 1), vocab))

where, above it:

from math import log2
from pyspark import SparkContext
from toolz import assoc


def get_linking(line):
    return line.split(",")


def initialize_for_voting(outlinks):
    return {"outlinks": outlinks, "n_outlinks": len(outlinks), "rating": 100}


def empty_ratings(d):
    return assoc(d, "rating", 0)


def allocate_points(acc, new):
    _, v = new
    boost = v["rating"] / (v["n_outlinks"] + 0.01)
    for link in v["outlinks"]:
        if link not in acc.keys():
            acc[link] = {"outlinks": [], "n_outlinks": 0}
        link_rating = acc.get(link, {}).get("rating", 0)
        acc[link]["rating"] = link_rating + boost
    return acc


def combine_ratings(a, b):
    for k, v in b.items():
        try:
            a[k]["rating"] = a[k]["rating"] + b[k]["rating"]
        except KeyError:
            a[k] = v
    return a

And here is the output of spark-submit lov_pagerank.py:

  1       10.6  http://purl.org/dc/elements/1.1/
  2       10.3  http://www.w3.org/2000/01/rdf-schema#
  3       10.3  http://www.w3.org/1999/02/22-rdf-syntax-ns#
  4        9.0  http://www.w3.org/2004/02/skos/core#
  5        8.9  http://purl.org/dc/terms/
  6        6.3  http://xmlns.com/foaf/0.1/
  7        6.3  http://www.w3.org/2002/07/owl#
  8        6.3  http://purl.org/dc/dcmitype/
...

We can see at a glance the “most reputable” vocabularies, and they don’t surprise me. What may be more helpful is to collect candidate vocabularies for your domain and focus on their relative scores in order to gauge whether any are “well-established” in a sense. Even more helpful may be to include multiple “types” of resources – with standards linking to and being linked from various databases and policies. FAIRSharing seems like it could eventually support open investigation of the latter kind.

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

Adapted from J. T. Wolohan, Mastering large datasets with Python: parallelize and distribute your Python code. Shelter Island, NY: Manning Publications Co, 2019. ↩︎

Lean Web - Principles of Lean Thinking applied to Web Development

2022-06-09T12:28:09-04:00

Lean manufacturing aims to reduce waste in production processes and to reduce response times to consumers from producers.

Womack and Jones¹ authored five key principles for lean thinking in the context of manufacturing:

Value: Identify the value of a product to a consumer.
Value Stream - Identify the minimal process (steps, time, information, material) to produce the value.
Flow: Make production flow through the steps.
Pull: Pull between the steps (rather than pushing intermediate “inventory” that may not be used).
Perfection: Reduce the number of steps and the amount of time, information, and material needed for production.

Lean software development aims to adapt lean thinking to software development.

The Poppendiecks² authored seven principles that don’t directly provide qualified references to Womack and Jones’ principles. Here, I attempt to align their principles of software development to the framework and terminology of Womack and Jones’s lean thinking principles:

Evaluate Late: Decide on the end-value of a product to a consumer as late as possible. There is one value stream option per end-value option.
Mind Value Stream Multiplicity and Looping: With one value stream per end-value hypothesis, can value streams share structure to eliminate waste? Value streams may have loops (iterations) that must be particularly lean to support a high learning rate.
Flow: Make production flow for fast delivery and thus for rapid learning given the presence of loops in a value stream.
Pull: Pulling between steps empowers the team.
Perfection: Continuous refactoring facilitates ensuring integrity and optimizing the whole.

Now, I can couch a conceptualization of lean principles for web development, i.e. Lean Web principles, with clear lineage to the lean thinking principles for manufacturing and through lean principles for software development:

Evaluate Resources Late: Deal in data for as long as possible. Apply transformation logic later – there are many applications. Apply presentation logic even later – there are many modes of consumption for an application. See also: Perlis’ epigram³: “Functions delay binding; data structures induce binding. Moral: Structure data late in the programming process.”
Mind Value Stream Multiplicity and Looping: Eliminate waste in process steps, time, information (configuration / manual signaling), and material (code, data, storage/compute infrastructure). Can web dev processes share logic? Pay particular attention to waste in value stream loops (iterations).
Flow: Choose continuous integration (CI) and continuous deployment (CD).
Pull: Choose distributed version control for code, data, and storage/compute infrastructure (as code).
Perfection: Can it all fit in your head, to facilitate conceptual integrity and strategic refactoring?

Finally, I am well aware of Chris Ferdinandi and his excellent exposition on Lean Web thinking and associated three principles. Here’s how I think his principles may map to those above:

Embrace the Platform: This relates to evaluating resources late. Can you exchange data as RDF (e.g. serialized as JSON-LD) over HTTP? Can you exchange logic for inference and validation as RDF data as well, via the RDFS/OWL and SHACL standards of the Web platform? Can you exchange logic for presentation as HTML (templates) and CSS? If your front-end requires operational processes, can that be done using vanilla JavaScript?
Small and Modular: This relates to minding value stream multiplicity and looping. There is a lot of opportunity to eliminate waste and reuse functionality (especially functionality provided by the platform!).
The Web is for Everyone: This relates to evaluating resources late (why prematurely optimize for applications and consumption use cases and thus exclude potential stakeholders?) and pulling (empower people by encouraging them to pull rather than telling them to pick up whatever is pushed).

J. P. Womack and D. T. Jones, Lean thinking: banish waste and create wealth in your corporation. New York, NY: Simon & Schuster, 1996. ↩︎
M. Poppendieck and T. Poppendieck, Lean software development: an agile toolkit. Boston: Addison-Wesley, 2003. ↩︎
A. J. Perlis, “Special Feature: Epigrams on programming,” SIGPLAN Not., vol. 17, no. 9, pp. 7–13, Sep. 1982, doi: 10.1145/947955.1083808. Online at http://www.cs.yale.edu/homes/perlis-alan/quotes.html. ↩︎

Hallucinating Datasets Across Epochal Time

2022-06-09T10:46:13-04:00

“Dataset” is a derived notion, a psychological construct, where “versions” of the dataset are a succession of values that we perceive to be causally related. “Dataset” is a side effect.

Consider Rich Hickey’s epochal time model, which I have written about previously:

Identity is a derived notion, a collecting of values and calling each value a “state”. A state is just a labeling of a value for an identity at a point in “time”. The succession of states is the identity. Identity is a side effect of choosing a timeline of value succession.

Consider drawing a dotted line on the figure above that encompasses all of the immutable values (boxes) and all of the ovals (pure functions). This may be considered the functional phase of a process, where data is transformed (accreted, reduced, or reshaped), separate from the operational (i.e., pull or push) phases of that process.

With this perspective, a process’s functional phase also suggests labeling its succession of values. Each value may be called “data”. That is, a value becomes data when it is an input or output for a process. Depending on where the value is in the topology of the process, it may be considered “raw data” or “derived data” with respect to that process.

What, then, may we call the succession of data values for the timeline of a process – what is the “identity” here, where successive values are “states”? In the OpenLineage specification, the name for this identity is “dataset”. In the Marquez reference implementation of OpenLineage, a “dataset version” is a read-only, immutable version of a dataset, i.e. an immutable value in the sense of the epochal time model.

Thus, “dataset versions” are the states, and “datasets” the identities, for the succession of values associated with the functional phase of a process. To the extent that an immutable digital object – a value – is useful in the functional phase of one or more processes, it is useful to identify it as a “dataset version”. To the extent that a succession of such values, which we perceive to be causally related via a process, is useful in whole or in part to various timelines of various processes, it is useful to identify this succession as a “dataset”.

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

¬ consistent ⇒ ¬ valid ⇒ ¬ accurate

2022-06-03T09:17:13-04:00

If it’s not consistent, it can’t be valid.

If it’s not valid, it can’t be accurate.

If it’s not accurate, who cares if it’s timely?

No amount of tooling, people, or process will transform invalid, inconsistent, inaccurate, untimely data into powerful insights, products, and applications

– https://twitter.com/sarahcat21/status/1532077087250452480

Data tools, people, and titles are ibuprofen so that the stakeholders don’t feel the pain of difficult data. But the pain remains if the source data isn’t addressed.

– https://twitter.com/cavorax/status/1532054828192608257

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

W3C data recommendations -- there are many!

2022-06-01T09:17:13-04:00

The World Wide Web Consortium (W3C) publishes a range of specifications and guidelines which help move web standards forward.

However, even when restricting scope to the Latest version of specifications with the status Recommendation and with the tag Data, there are currently 77 of them: https://www.w3.org/TR/?tag=data&status=REC&version=latest!

I read through the listing, and here I try to categorize and present a subset of the specifications that I think are most relevant to scientific data management:

description representations, i.e. formal ways to define and communicate data, metadata, and queries:
- Resource Description Framework (RDF)
- SPARQL Protocol and RDF Query Language (SPARQL)
description metamodels, i.e. formal ways to define and communicate models:
- Shapes Constraint Language (SHACL)
- Relational Database to RDF Mapping Language (R2RML)
- RDF Schema
- Web Ontology Language (OWL)
- Rule Interchange Format (RIF)
description models, i.e. models that may be applied directly or may serve as umbrellas for more specialized models:
- Data Catalog Vocabulary (DCAT)
- Provenance Data Model (PROV)
- Simple Knowledge Organization System (SKOS)
- CSV for the Web (CSVW)
- RDF Data Cube Vocabulary
- Organization Ontology
- Open Digital Rights Language (ODRL)

I have left out specifications for serialization, i.e. the text-based appearance of things when viewing/editing them and their formats as files as disk.

Still, 14 specifications is a lot! I’ve tried to list them out in each category in order of roughly decreasing “bang for your learning buck” for typical use cases I’ve encountered.

I’d love to hear from you which, if any, of the specifications above you’ve found useful and/or which you would like to know better (or at all!).

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

Data Stacks for FAIR

2022-05-30T14:39:34-04:00

I noticed a pattern at the top of each case study listed by Stemma.ai, which provides data catalog software as a service based on the open-source Amundsen code. Each case study’s so-called “Data Stack” comprises up to four distinct categories of functionality – Data Catalog, Data Warehouse, ETL, and Business Intelligence.

The “Data Stack” for each case study:

Case	Data Catalog	Data Warehouse	ETL	Business Intelligence
Lyft	Amundsen	Presto	Apache Airflow	Mode,Apache Superset
Convoy	Stemma	Snowflake	dbt, Apache Airflow	Tableau, Metabase
iRobot	Stemma	Amazon Athena	(blank)	Mode
ING	Amundsen	Trino (formerly, Presto SQL)	(blank)	Apache Superset

These categories struck me in relation with the FAIR Principles¹:

A Data Catalog is about making data Findable.
A Data Warehouse is about making data Accessible.
An ETL platform, aka a Data Orchestration² platform, is about making data Interoperable.
A Business Intelligence (BI) tool is about making data Reusable, aka Repurposeable.

It’s encouraging to see high-level alignment between the FAIR Principles and a conceptualization of useful enterprise data systems in the corporate world.

M. D. Wilkinson et al., “The FAIR Guiding Principles for scientific data management and stewardship,” Sci Data, vol. 3, no. 1, p. 160018, Mar. 2016, doi: 10/bdd4. ↩︎
Although a term I think may be more apt here than Data Orchestration, which has an imperative tone, is Data Reconciliation, which has a declarative tone – see e.g. S. Ryza, “Introducing Software-Defined Assets”, Dagster Blog, Mar. 2022. https://dagster.io/blog/software-defined-assets (accessed May 31, 2022). ↩︎

A Sign Helps You Use It as Though It Were an X

2022-05-29T15:58:15-04:00

Suppose an alien architect has invented a radically new way to go from one room to another…We would never recognize it as a door…All its physical details are wrong. No matter: just superimpose on its exterior some…sign that can remind us of its use. Clothe it in a rectangular shape, or add to it a push-plate lettered EXIT in red and white, and every visitor from the planet Earth will know…just what that pseudoportal’s purpose is.

…There are no doors inside our minds, only connections among our signs.¹

For evolvable data exchange, you need to be able to continually add qualified references galore so that participants can reason by analogy – i.e., each new thing resembles something known before.

This is FAIR principle I3, which depends on I1 and I2 for robustness.

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

M. Minsky, The Society of Mind. New York: Simon and Schuster, 1986, p. 57. ↩︎

FAIR Principle R1.1: Meta(data) are released with a clear and accessible data usage license

2022-05-29T15:33:35-04:00

I’ve been recording introductions to each of the 15 FAIR Principles and releasing them as episodes of my Machine-Centric Science podcast (https://podcast.polyneme.xyz/).

I just released the 13th one, featuring an overview of various data and code licenses. Listen here.

Full transcript below (but also linked to via the episode landing page):

======

Hello, and welcome to Machine-Centric Science. My name is Donny Winston, and I’m here to talk about the FAIR principles in practice for scientists who want to compound to their impacts, not their errors. Today, we’re going to be talking about the 13th of the 15 FAIR principles, R1.1: metadata and data are released with a clear and accessible data usage license.

The license may be different for a data resource and the metadata that describes it. This has implications for indexing of the metadata and findability as well as ultimately using the data. It highlights the need to separate and permalink the data and the metadata.

By default, resources cannot be legally used without clarity in licensing. And furthermore, a license that cannot be found by an agent, a computational agent, is effectively the same as no license at all in a world of automated search and discovery.

There are lots of options in the world of licensing. I will go over the Creative Commons suite of data licenses, and I’ll also go over some code licenses, and relations between them.

Starting from most open, with Creative Commons, there’s CC0, no rights reserved.

After that, we have CC BY – by Attribution. This license lets others distribute, remix, adapt, and build upon your work, even commercially, as long as they credit you for the original creation. It’s the most accommodating of licenses offered that still require attribution.

Beyond that, there’s the CC BY SA – attribution and share-alike. This license let’s others use your work, even for commercial purposes, as long as they credit you. And also they need to license their new creations under identical terms. So all new works based on the work will carry the same license. The attribution share-alike is the license used by Wikipedia.

Closing up things a bit, we have attribution-no-derivatives, CC BY ND. This license lets others reuse the work for any purpose, including commercially. However, it cannot shared with others in adapted form, so you can’t make any changes.

Closing things up a bit more there’s attribution, but noncommercial, so you can use the stuff, but non-commercially. You can provide derivative works, but the derivative work that’s distributed has to be non-commercial.

Further down the line there’s attribution non-commercial share-alike. This lets others remix, adapt, and build upon your work non-commercially, as long as they credit you and license their new creations under identical terms.

And finally, attribution, non-commercial, no-derivatives just allows people to download the work, share them with others as long as they credit you, but they can’t change the work in any way or use it commercially.

So, those are the Creative Commons licenses typically use for data. Then there are the code licenses, and these aren’t quite along the same spectrum of open to closed. Rather, the spectrum is more going from maximizing user freedom to maximizing redistributor freedom.

The most user-free license that I’ve encountered recently that’s in popular use is the Server Side Public License that’s in use by, say, MongoDB. And this is akin to the Creative Commons attribution share-alike license, but additional sharing. So if someone’s offering this software as a managed service, they have to supply the source code and they also have to supply the source code for all of the service tooling that’s helping them to provide that service, like managed backups, et cetera. So, it goes even beyond the actual code. So it really makes sure that whoever is using the software really can reproduce the use. And so that maximizes that.

A little bit less than that, keeping to the domain of the actual code itself is the Affero GPL, the AGPL. That’s more akin to CC attribution share-alike, where even if you’re distributing the code through a service online, through a managed service, and you’re not actually distributing the source code directly, you still have to supply the source code to the users.

Okay, going down, there’s then the GPL, the GNU Public License. Here, if you’re offering the software as a service, you don’t have to include source modifications – it’s only if you’re actually distributing the source code.

Then, we have some licenses that are more compartmentalized to the actual portions of the code that’s being reused. So there’s the Lesser GPL and the Mozilla Public License, which are, again, like Creative Commons with attribution and share-alike, but if those licensed components are combined with other software, the user does not have the right to have the source code for all the other components that are necessary to use the system. They only have the right to modifications made to your component that is under the Mozilla Public License or Lesser GPL. But there may be other parts of the software system, as distributed, that are protected, are proprietary.

So this kind of gives a bit more freedom to the redistributor if they have proprietary code that they want to mix in, or pair with, rather, the open software. So it kind of makes the user have a little bit less insight into the total software of the system, but they still have insight into your component that you’ve released under LGPL or Mozilla Public License, MPL.

There is the Business Source License. This is kind of akin to a Creative Commons by attribution, noncommercial license that typically reverts to a by-attribution, commercial-okay license. So, Business Source License, like Sentry’s monitoring service, has an Additional Use Grant, which says, you can use this however you want as long as you do not offer commercially a managed service. So only we, the company that makes this software, can offer a managed service where we offer this to third parties. But as a user, you can do whatever you want internally. You just can’t also be a company that sells this software as a service to other companies.

So again, this offers a lot of user freedom, but has a bit more emphasis on redistributor freedom to clamp down on use. And Elasticsearch’s Elastic License is similar in the sense that a user can sort of do whatever they want with it, but they’re restricted from redistributing it commercially as a service.

Then, going down more towards freedom of the redistributor, we have things like the Apache 2.0 license, which is more like a Creative Commons with attribution, and also adds in some share-alike for contributions. So by default, anyone contributing to an Apache 2 codebase also grants their contributions to be distributed under the same license.

And then, the most so-called permissive licenses are things like the BSD license, Berkeley Software Distribution, and the MIT license, and those are more akin to just Creative Commons attribution. So there’s really no other restrictions or conditions on use, about things that are analogous to share-alike or non-commercial or non-derivative or you can’t offer this as a managed service, or if you do, you need to offer all of the code for your associated tooling for the service. It kind of places the most freedom on someone who has the software and is wanting to redistribute it or repurpose it in some way. So those are generally going to be the most compatible options with everything else.

One other license I wanted to bring up a bit just in the context of this podcast and science – there’s a fun license called the CRAPL. And it’s a quote unquote academic-strength open-source license. I wouldn’t necessarily recommend it, but I want to bring it in here just so that I can compare it to some of the other licenses that I’ve mentioned.

In the software world, I would say that the CRAPL is similar to a Business Source License with no Change License – a business source license, normally after a certain period, like two or three years, will turn into a much more permissive license, so it will become actually Open in the sense of the Open Source Initiative. And the CRAPL has the Additional Use Grant, in the context of a Business Source License, that it is only for validating published claims and validating pre-publication claims. And furthermore, the use grant allows one to publish those claims on the conditions that (1) the original author is notified of the use and claims prior to submission, and (2) that those modifications are released under the CRAPL when the supporting claims are publicly released, say, in a publication.

So, this is also kind of like the Creative Commons by attribution, non-commercial, share-alike, with the additional condition that a good faith attempt is made to notify the original author prior to public claims and distribution of your modifications.

So in summary, there are a bunch of widely used licenses. Popular licenses in the data world are the Creative Commons suite. Popular licenses in the code world are things like Apache 2.0, BSD, MIT, Mozilla Public License, the GPL – Lesser, Affero – and also newer ones that extend the others to noncommercial restrictions, like the Business Source License and the Server Side Public License, or, rather, conditions on commercial use.

If your data and metadata are not covered clearly and accessibly by one of these licenses, or if there are additional restrictions, if those aren’t clearly and accessibly provided, then reuse of your data is going to be jeopardized.

That’ll be it for today. I’m Donny Winston and I hope you join me again next time for Machine-Centric Science.

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

Sending Signal-Signs

2022-05-29T15:33:35-04:00

Sending signal-signs¹
to steer engines of compute,
the wheel does no work.

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

M. Minsky, The Society of Mind. New York: Simon and Schuster, 1986, p. 56. ↩︎

The Persistence of Identity

2022-05-24T14:07:33-04:00

What is that strange possession that stays the same throughout its life?¹

Can we recollect how things appeared to us before we learned to link new meanings to those things?

What is this body of changelessness in spite of change?

Perhaps the purview of a thing’s persistence is its predictable pathways of provenance:²

the typical effects of its typical activities,³
its body of influenc(ed/ing) entities⁴ whose meanings change only slowly, and
whichever of its agents⁵ change the least as its life proceeds.

Data does not have intrinsic meaning:

The semantics of our data are defined by the effects it produces when passed into our functions. These effects should be predictable whenever possible, but data cannot prevent itself from being interpreted in surprising ways.⁶

An identifier is an association between a string of data and an object.⁷. The semantics of our identifiers are then defined by the effects produced by interpreters that believe records bearing witness to these associations.

A layer of indirection separates what something does from how it does it. Similarly, an identifier separates what something is from how it is.

What are some tools for predictability in indirection?

referential transparency: The semantics of purely functional code will remain the same if we replace every expression with its result, e.g. “1 + 1” with “2” (for typical senses of “1”, “+”, and “2”!).
invariant relations: The semantics of a data structure’s interface – its abstract representation, its exposed behavior – will remain the same if we decide to change its concrete representation – its internal model – as long as we enforce appropriate invariant relations on that concrete representation.⁸

Effects are the currency of meaning, yet their causes and conditions are ever fleeting:

…everything in the world is the result of a vast concurrence of causes and conditions, and everything disappears as these causes and conditions change and pass away.

– Buddha⁹

“All models are wrong, but some are useful.” An identity is ultimately a model, an abstract description that hides certain details while illuminating others, that can yield useful predictions when it provides adequate explanations relating primitive phenomena to one another and to more complex phenomena. Go forth and identify.

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

M. Minsky, The Society of Mind. New York: Simon and Schuster, 1986, p. 54. ↩︎
https://www.w3.org/TR/prov-o/ ↩︎
https://www.w3.org/TR/prov-o/#Activity ↩︎
https://www.w3.org/TR/prov-o/#Entity ↩︎
https://www.w3.org/TR/prov-o/#Agent ↩︎
Z. Tellman, Elements of Clojure. Monee, IL: Lulu.com, 2019. ↩︎
J. A. Kunze, “The ARK Identifier Scheme (v.34).” Internet Engineering Task Force (IETF), Jan. 2022. [Online]. Available: https://datatracker.ietf.org/doc/draft-kunze-ark/ ↩︎
C. A. R. Hoare, “Proof of correctness of data representations,” Acta Informatica, vol. 1, no. 4, pp. 271–281, 1972, doi: 10.1007/BF00289507. ↩︎
B. D. Kyōkai, The teachings of Buddha, 1. ed. New Delhi: Sterling Publishers, 2004. ↩︎

Why Is There Not Just One Metadata Format for All Kinds of Research Data?

2022-05-13T09:21:10-07:00

Why is there not just one metadata format for all kinds of research / data?

– asked on fairdataforum.org

Metadata modeling and formatting are separate concerns. It is reasonable that different scientific domains and studies within domains may have widely varying modeling concerns. Controlled vocabulary terms, validity constraints, and other metadata elements will surely vary and evolve over time.

What’s not as obvious is why different scientific domains and studies within domains would have different formatting concerns. Different software applications and tools may have their preferred metadata formats for operational convenience. Thus, as some software gains prominence in a specific domain, its preferred format may be adopted by other tools in the ecosystem for ease of exchange and integration.

For there to be a single metadata format that is universally adopted for metadata exchange — that is, a format that a given software tool may convert to a preferred internal format for convenience of use by the tool — that format would need to be able to communicate the model being used as well. Thus, the format would need to host a language for defining models.

There have been some efforts at this. One effort that has gained some recognition in the FAIR data community is that of the Semantic Web set of standards. Specifically, the Resource Description Framework (RDF) base model, exchanged using a handful of standardized plain-text formats such as JSON-LD, and using RDF-expressed modeling languages such as RDFS (RDF Schema), OWL (Web Ontology Language), and SHACL (Shapes Constraint Language), is one effort towards a universal “meta-model” for defining and exchanging metadata models along with the metadata itself, in plain-text formats that both humans and machines can interpret unambiguously, if only to convert metadata to preferred internal modeling languages and formats.

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

A Phase Space of Reproducibility

2022-05-09T12:29:56-04:00

What does “reproducible science” encompass?

A tug-of-war

Here is one decomposition into “repeatability”, “reproducibility”, and “replicability”:¹

“Repeatability (Same team, same experimental setup)…Reproducibility (Different team, same experimental setup)…Replicability (Different team, different experimental setup)…”

Here is a conflicting account of the relationship between “replicability” and “reproducibility”:²

“…even if we can replicate the results of a paper, slightly altering the experimental setup could have dramatically different results. For these reasons, we don’t want to consider the authors code, as this could be a source of bias. We want to focus on the question of reproducibility, without wading into the murky waters of replication.”

It seems that ACM’s “replicability” is Edward Raff’s “reproducibility”, and vice versa. And the colloquial phrase “repeat after me” is at odds with ACM’s “repeatability”.

A trip to the dictionary

Merriam-Webster³ defines reproduction as

the process by which plants and animals give rise to offspring and which fundamentally consists of the segregation of a portion of the parental body by a sexual or an asexual process and its subsequent growth and differentiation into a new individual.

and reports the first known use in circa 1640, in the above sense.

So, it seems to me that reproduction subsumes replication — replication is a sub-type of reproduction where, subjective to an observer, the new individual is a replica — an indistinguishable image — of the parental body.

Repetition seems like the kind of reproduction where no material portion of a parental body is involved. “Repeat after me” is about a second agent observing the first agent and reproducing the output of the first agent without material reuse of a portion of the first agent’s material output.

Repetition is about following the same method but without using a seed from the original performance.

Reproduction of result, i.e. growth of a new individual, can occur / be attempted with or without repetition of method. With or without repeated method, the growth of a new individual from the seed may or may not be successful.

Repetition of methods, replication of results

Repetition seems method-focused (same activities) whereas replication seems result-focused (same outcomes).

A reproduction can be perceived as more or less a repetition of the original production activities, and can orthogonally be perceived as more or less a replication of the original production outcomes.

Reproduction = Representation + Repetition + Replication

In terms of the W3C Provenance ontology’s⁴ core types of Agent, Activity, and Entity (diagram at top reproduced here):

Reproducibility is the space. One axis is repeatability, i.e. “activity-dependent reproducibility”. Another axis is replicability, i.e. “entity-dependent reproducibility”. The final axis is “agent-dependent reproducibility”. For scientific reproducibility, we really (really) would like to ignore the Agent axis — sure, agent/representative identity is correlated with entity resourcing and activity resourcing capacity and skill in the real world, but we’d rather not consider it an independent axis.

Thus, we consider a scientific process reproducible in part by its repeatability (reproduction of activities) and its replicability (reproduction of entities - artifacts, results, outcomes). This seems subjective, But “up and to the right” in the figure above is what we seem to seek.

How repeatable is reproduction? How replicable is it?

Engineering emphasizes modeling for prediction, whereas science emphasizes modeling for explanation. Thus, while repeatability (same activities) may not be valued for outcome-focused reproducibility in engineering, repeatability is valued for activity-focused reproducibility in science, that is, for explainability in terms of causes and conditions.

On independent reproduction

While we can consciously seek repeatability and replicability in reproduction attempts, we also typically value so-called “independent” reproduction, where it seems the investigating agent(s) were either not aware of original activities or of original starter/intermediate/output entities, or both, and yet reproduced them anyhow.

The significance of independent reproduction is not so much for validity of reproduction as it is for assigning credit to discovery, but still, independent reproduction is often perceived as more valid due to a perceived reduction in risk of dishonest reporting.

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

“Artifact Review and Badging - Current.” Association for Computing Machinery. https://www.acm.org/publications/policies/artifact-review-and-badging-current (accessed May 09, 2022). ↩︎
E. Raff, “Quantifying independently reproducible machine learning,” The Gradient, Feb. 2020, [Online]. Available: https://thegradient.pub/independently-reproducible-machine-learning/ ↩︎
“Definition of REPRODUCTION.” https://www.merriam-webster.com/dictionary/reproduction (accessed May 09, 2022). ↩︎
“PROV-DM: The PROV Data Model.” W3C Recommendation, April 30, 2013. https://www.w3.org/TR/prov-dm/ (accessed May 09, 2022). ↩︎

Let Traits Accrete

2022-05-06T07:20:02-04:00

How can it be that complex, dynamic objects can be described by short and simple strings and words? We often seek:¹

Selectivity – Our images are often falsely clear. We may think of an object’s “personality” in terms of that which we can easily describe. We may set aside the rest for now as though it simply weren’t there.
Style – To avoid making decisions we consider unimportant for now, we may develop policies that become systematic traits.
Predictability – It’s hard to maintain fruitful exchange without trust, so we may try to conform to expectations. To the extent we frame our images of producer/consumer systems in terms of traits, we teach our data to behave in accordance with those same traits.
Self-Reliance – Imagined traits can, over time, make themselves actual because we must be able to predict outcomes of the use of our own data. This prediction becomes easier the more we simplify our models.

We need to be able to trust our own data, logic, and presentation resources. One way to accomplish this is to think of these resources in terms of traits, and then proceed to train those dynamic resources to behave according to those immediate images.

Still, like a personality is merely the surface of a person, a schema is merely the surface of a dynamic digital object. What we call traits, properties, etc. are only the regularities we manage to perceive and deem worthy of systematizing at present.

We may not be able to “pin down” the traits of our digital resources because there are many processes and policies that don’t yet show themselves directly in elicited behavior but that work behind the scenes and that may only become important to name and systematize later.

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

M. Minsky, The Society of Mind. New York: Simon and Schuster, 1986, p. 53. ↩︎

Keying Into Fashion and Style for Knowledge Arrangement

2022-05-05T09:29:55-04:00

We often have sound practical reasons for making choices that have no reasons by themselves but have effects on larger scales.

Familiar styles make it easier for us to recognize and classify the things we see. For example, we may choose furniture according to systematic styles or fashions.

We protect ourselves from distractions by adopting uniform styles. For example, if every object in a room were interesting in itself, our furniture might occupy our minds too much.

Societies need rules that make no sense for individuals. For example, it makes no difference whether a single car drives on the left or on the right, but it makes all the difference when there are many cars!

It can save a lot of mental work if one makes each arbitrary choice the way one did before. The more difficult the decision, the more this policy can save. And there’s a paradox:¹

The more equally attractive two alternatives seem, the harder it can be to choose between them – no matter that, to the same degree, the choice can only matter less.

Thus, it is helpful to take recourse in rules of style when we’re fairly sure that further thought will just waste time. We should not abandon reasoning recklessly, but often, ordinary reasons cancel out, so it makes sense to use forms that lie beneath the surface of our thoughts – style, fashion, art – our “taste”.

When it comes to exchanging knowledge representations, you’re inviting your recipients into rooms you’ve curated – consider whether your arrangement of “furniture” there is adding value or distracting.

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

M. Minsky, The Society of Mind. New York: Simon and Schuster, 1986, p. 52. ↩︎

Identity and Concurrency

2022-05-04T15:13:15-04:00

Regarding a resource – dataset, model, tool, standard, agent, etc. – as a single thing can be helpful: in allocating physical space, in dealing with privacy and responsibility, in de-confusing mental activity.¹

Are human mental processes actually clean “streams of consciousness”, or is narrative a tool used to “straighten things out”, to simplify the representation of what happened? Is there really a single pipeline of ideas that flow through a mind?

And is a straightened-out story faithful to “raw” observation, or was a schema employed, a design pattern, an archetype, such as The Hero’s Journey?

In computing, modeling a workflow as a single process, a sequence of actions, is easiest for us mentally. As opposed to multiple concurrent processes.

And yet it is often operationally helpful to support concurrency via resource properties like immutability, idempotence, etc., even if we later explain “what happened” – if we later communicate activity provenance to fellow human beings – as a single process, a story, a linear narrative.

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

M. Minsky, The Society of Mind. New York: Simon and Schuster, 1986, p. 51. ↩︎

"We Need to Dockerize and Distribute Robert"

2022-04-29T10:51:48-04:00

It does not help for you to think that inside yourself lies someone else who does your work. This notion of “homunculus” – a little person inside each self – leads only to a paradox since, then, that inner Self requires yet another movie screen inside itself, on which to project what it has seen! ¹

"The remote-control self" [1].

A thing with no parts provides nothing that we can use as pieces of explanation…Why are we tempted to embrace the strange idea that what we do is done by Someone Else – that is, our Self? Because so much of what our minds do is hidden from the parts of us that are involved with verbal consciousness. ¹

A Sidney Harris classic.

Design is taking things apart in order to be able to put them back together. ² You must design the digital resources you archive and disseminate, so that you don’t “need to dockerize and distribute Robert” (overheard in a Slack room), which, of course, you can’t.

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

M. Minsky, The Society of Mind. New York: Simon and Schuster, 1986, p. 50. ↩︎ ↩︎
R. Hickey, “Design, Composition and Performance,” presented at the QCon, San Francisco, Nov. 2013. Transcript. ↩︎

Ensure That Provenance Bottoms Out

2022-04-28T11:33:01-04:00

Some questions may be pursued circularly, where for example you cannot find a final cause – you must ask, What caused that cause? Or you cannot find an ultimate goal – Then what purpose does that serve? Such loops can waste our time.¹

It is a form of self-control to establish ways to bottom out, to employ base cases to stop recursion. When a child repeatedly asks Why?, adults may employ Just because!

Cultures establish ways to deal with the need for bottoming out such as branding with shame or taboo, cloaking in awe or mystery, and consensus. Cultures evolve institutions that adopt specific answers to circular questions and establish authority-schemes to enforce these beliefs.

One could complain that such establishments substitute dogma for reason and truth. But in exchange, they spare whole populations from wasting time in fruitless reason loops. Rather, minds can more productively work on problems that can be solved.

When annotating a digital research object with lifecycle provenance metadata, including conceptual provenance relating to hypotheses and study design, it is reasonable to “bottom out” to a current consensus view, a milestone along the path of Kuhnian² paradigm shifts stretching to the past and to the unknown future.

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

M. Minsky, The Society of Mind. New York: Simon and Schuster, 1986, p. 49. ↩︎
T. S. Kuhn, The structure of scientific revolutions. 1962. ↩︎

"Straightening Out" Circular Causality

2022-04-27T11:31:34-04:00

We often seek to “straighten out” a maze-like, loop-containing situation. We try to find a “path” through “causal” explanations that go in only one direction. Why?

There are countless different types of networks that contain loops. But all networks that contain no loops are basically the same: each has the form of a simple chain.

Any directed acyclic graph (DAG) can be linearized, i.e. topologically sorted. And we can apply the same types of reasoning to everything we can represent in terms of chains of causes and effects. We can proceed from start to end without any need for a novel thought.

But frequently, to construct such a path, we have to ignore important interactions and dependencies that run in other directions.

In loopy situations, one may find success in shifting from “causal” learning to “clausal” learning. If data values are annotated with dependencies, e.g. labeled with external provenance, with justifications for data-processing decisions, etc., then dependency-directed backtracking may help to path-find by avoiding sets of premises that support previously discovered contradictions.¹

In this way, annotating data with provenance metadata can formally help to “straighten things out”.

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

C. Hanson and G. J. Sussman, Software design for flexibility: how to avoid programming yourself into a corner. Cambridge: The MIT Press, 2021. ↩︎

Data Ventriloquy

2022-04-25T11:41:48-04:00

Punch and Judy, to their audience:

Our puppet strings are hard to see,
So we perceive ourselves as free,
Convinced that no mere objects could
Behave in terms of bad or good.

To you, we mannikins seem less
than live, because our consciousness
is that of dummies, made to sit
on laps of gods and mouth their wit;

Are you, our transcendental gods,
likewise dangled from your rods,
and need, to show spontaneous charm,
some higher god’s inserted arm?

We seem to form a nested set,
with each the next one’s marionette,
who, if you asked him, would insist
that he’s the last ventriloquist.

– Theodore Melnechuk

Who’s the last ventriloquist when it comes to a dataset? You pull data and accrete it with other data, reshape it, and reduce it, and ultimately make it dance and speak via structured representation, software action, narrative exposition, etc.

Can someone further contribute to the life journey of that processed data, repurposing your representation, inserting their arm to make an adjustment with a tool of their choice?

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

Crossing the Inter-Lab Chasm

2022-04-24T11:39:00-04:00

Without enduring self-ideals, our [research] would lack coherence. As individuals, we’d never be able to trust ourselves to carry out our [protocols]. In a social group, no one person would be able to trust the others. A working society must evolve mechanisms that stabilize ideals – and many of the social principles that each of us regards as personal are really “long-term memories” in which our cultures store what they have learned across the centuries.

Your electronic lab notebook (ELN) and/or laboratory¹ information management system (LIMS)² embodies ideals in various implicit or explicit forms: data formats and validators, software transformation functions and tests, planning/recording/reporting document templates, etc.

Are these ideals siloed in your ELN/LIMS, or are they shared? If so, how, and is sharing robust?

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

whether “controlled physical / wet”, “field”, or “computer” laboratory ↩︎
and you do have one, whether personal (and perhaps embarrassing), project-wide, lab-group-wide, institute-wide, etc. ↩︎

Place Oriented Publishing Versus Value Oriented

2022-04-23T11:37:22-04:00

In place-oriented publishing, as in place-oriented programming, you allocate places to push things, and you pull from places. Places steward values. “Where” something is, is important. Did you publish to a reputable place?

In value-oriented publishing, as in value-oriented programming, you pass around values, or references to values, not to places. You deal directly in values. Dereferencing services steward values. “What” something is, is important, not “where” something is. Did you publish something valued as reputable by other resources - peer reviews, other publications, etc. - that reference it and in turn are valued as reputable (c.f. PageRank)?

Slot Long Range Plans Into Ecosystems

2022-04-22T11:36:21-04:00

A principled system has predictable relationships between its modules, whereas an adaptable system has sparse and flexible relationships between its modules.

In a selfconscious culture, design and construction is a specialized task, taught in schools using abstract principles, whereas in an unselfconscious culture, design and construction is taught using direct demonstration and reflects the constraints and variation of an environment.¹

The structures of a selfconscious culture are principled; they are not meant to change. If the environment changes, the structure is hardened against the change rather than adapting to it. Some principled structures, like skyscrapers or stadiums, can hold thousands of people.

The structures of an unselfconscious culture are adaptable; they reflect the present needs of the inhabitants. Examples include igloos – there is no “architect” – each person builds their own home. If an igloo grows too warm, someone can poke a hole in the wall. When it grows too cold, the hole can be filled in. Such structures tend to be only large enough to hold a single family.

How can we be principled, having long-range plans, and also be adaptable? How can we balance what we want to be versus what we want now?

An interface pulled in many directions is intrinsically stable, but an interface pulled in a single direction tends to shift – the interface itself will become vestigial. For example, the mitochondrion is now just another interdependent part of a principled whole – it is no longer an independent organism.

It is the ecosystem, not the organism, that adapts to change. An organism may disappear, its niche filled by something else. Roles are fungible because organisms consume and emit the same resources; they share a common interface.

Consider Overton windows: lawmakers often position proposed legislation (principled components) wrt an observed ecosystem of discourse (currently-stable interface).

We cannot have a system that is wholly principled. We can have a collection of principled components, built to be discarded, slotted into / separated by interfaces that can last only given a rich ecosystem of alternatives.

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

C. Alexander, Notes on the synthesis of form, 1964. ↩︎

Schemes for Indirect Control

2022-04-20T10:55:57-04:00

There are two fundamental approaches to indirect control in code.¹

In an open approach, we can change the behavior of dereferencing code by conveying different values. The decision-making mechanism must be unordered. Typically, this is implemented in code using a data structure with a distinct set of keys, i.e. a lookup table.

In a closed approach, we can only change the decision-making process by changing the underlying code. A conditional, e.g. an expression that uses an if/elif/else or match/case form in Python, is closed. It is ordered, and if predicates aren’t disjoint, order matters.

While an open table conveys values, a conditional decides based on values. For a table to be useful, it must avoid conflicts. Conditionals avoid conflicts by making explicit, fixed decisions. An open approach must avoid conflicts in a dynamic way.

Consider all the kinds of tricks we use to try to force ourselves to work when we’re tired or distracted:

Willpower: Tell yourself, “Don’t give in to that,” or, “Keep on trying.”
Activity: Move around. Exercise. Inhale. Shout.
Expression: Set jaw. Stiffen upper lip. Furrow brow.
Chemistry: Take coffee, amphetamines, or other brain-affecting drugs.
Emotion: “If I win, there’s much to gain, but more to lose if I fail!”
Attachment: Imagine admiration if you succeed – or disapproval if you fail – especially from those to whom you are attached.

So many schemes for self-control! How do we choose which ones to use? There isn’t any easy way. Self-discipline takes years to learn; it grows inside us stage by stage.

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

Z. Tellman, Elements of Clojure. Monee, IL: Lulu.com, 2019. ↩︎

Directness Is Dangerous

2022-04-19T13:17:02-04:00

If self-control were easy, we might end up accomplishing nothing at all. Extinction would be swift indeed for species that could simply switch off hunger or pain. Instead, there must be checks and balances.

Imagine if any one process could seize and hold control over all the rest – we wouldn’t complete many tasks. So, for processes to exploit each other’s skills, roundabout pathways have to be discovered.

Fantasies can provide missing paths. You may not be able to make yourself angry simply by deciding to be angry, but you can still imagine objects or situations that make you angry, that, in effect, arouse your Anger and, for example, its tendency to counter Sleep as a self-control “trick” to continue Work.

For conscious schemes for self-control to work, e.g. in which we offer rewards to ourselves, incentives need to be discovered – our processes’ dispositions must be learned. This may involve bargaining and deception. But self-incentive tricks often don’t work because, again, directness is dangerous.

Indirection is a hallowed design technique in computer programming. Code and processes can be made more robust through separation between what and how, when “how does this work?” is best answered, “it depends.”

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

The Conservative Self

2022-04-18T22:00:01-04:00

To understand what we call the Self, we first must see what Selves are for. One function of the Self is to keep us from changing too rapidly...If we changed our minds too recklessly, we could never know what we might want next. We’d never get much done because we could never depend on ourselves.

Consider Hickey’s epochal time model:

What is the function here of Identity (i.e., the Self)? What makes a succession of states more than just a sequence of values?

Consider identity = id + entity. That is, an identity is a unique instance (with a primary key id) of an entity of a certain type.

What makes something an entity of a certain type? It must satisfy an entity spec, i.e. maintain model invariants, including but not limited to the syntactic schema.

Can something have multiple “id-entities”? Sure. Something can be a Study with ID 123 as well as an Activity with ID 234 if (a) each value over time passes validation as both a Study and an Activity, and (b) each value over time resolves to the same unique individual within each of the Study and Activity model abstractions – i.e., as a function of the data attributes that these models choose to consider in order to distinguish individuals, each value is a “state” of “the same” individual.

# expression of a non-unique name assumption
study:123 owl:sameAs activity:234 .

To achieve conservation of id-entity, to successfully associate an id-entity – an identity – with a sequence of values, to proclaim that sequence to be a succession of states, is to feel confident that certain model invariants are being maintained across the values’ history, that change is not reckless, that one can depend on something.

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

Core Versus Crust

2022-04-18T16:18:06-04:00

The art of a great painting is not in any one idea, nor in a multitude of separate tricks for placing all those pigment spots, but in the great network of relationships among its parts.

Similarly, the collection of data and code that comprise a digital resource are by themselves as valueless as aimless, scattered daubs of paint. What counts is what we make of them, in operational and functional phases of processes in systems.

The value of a FAIR resource lies not in some small, precious core, but in its vast, constructed crust.

A fierce belief in conceptual cores – in spirits, souls, or essences – may insinuate a helplessness to improve. To seek virtue in such cores may be as wrongly aimed a search as seeking art in canvas cloths by scraping off the painter’s work.

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

One Self or Many

2022-04-15T10:22:47-04:00

Is a self a centralized entity? Is it a society that includes both images of what is (“data”) and ideals about what ought to be (“schema”)?

Sometimes we feel decentralized or dispersed, as though we were made of many different parts with different tendencies: “One part of me wants this, another part wants that. I must get better control of myself.” We sense feelings of disunity, conflicting motives, compulsions, internal tensions, and dissensions. We carry on negotiations in our head.

We also may have a single-view at times: “I think, I want, I feel. It’s me, myself, who thinks my thoughts. It’s not some nameless crowd or cloud of selfless parts.” And the times we feel most reasonably unified can be just the times that others see us as the most confused.

In an epochal time model, a “self” - an identity – is a succession of images – of states; observers/processors may interpret the state-image data via a succession of contexts — of ideals / schema-versions.

Recent renewed interest in domain-team-driven distributed data-product governance, i.e. “data mesh”, may too be an expression of one-self-or-many tension.

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

What Functions Do Ideas About Selves Serve?

2022-04-13T08:32:51-04:00

One must not mistake defining things for knowing what they are. You can know what a tiger is without defining it. You may define a tiger, yet know scarcely anything about it.

“Self” is a term used to talk about a sense of identity. Instead of asking, “What are selves?” we can ask, instead, “What are our ideas about Selves?” – and then we can ask, “What psychological functions do those ideas serve?”

Our ideas about our Selves include beliefs about what we are – both what we are capable of doing and what we may be disposed to do. We may refer to such beliefs as self-images, as opposed to self-ideals, that is, ideas about what we’d like to be or about what we ought to be.

When dealing with digital resources – datasets, models, workflows, schema – there are subtle semiotics at play in representing and communicating these selves and their identities:¹

There are real things that occupy a given domain and scope of inquiry that are, unfortunately, neither understandable nor transmittable as fully correct messages (in the Shannon information sense).

Consider a dynamic digital object that represents the total information theoretic potential of – that is, all that one might say about – a real object. In representing our dynamic objects, we can only convey them as somewhat incomplete immediate objects – there is information loss.

So too is there loss in how these immediate objects are pointed to or signified – as something iconic like an image, described in words, etc. Our signification, our message, is imperfect.

And there is information loss at the response level by the interpreter that must decode the message that had to be encoded.

Finally, this may be the case not just for real things, but for ideal things - not just what is, but what ought to be.

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

M. K. Bergman, A Knowledge Representation Practionary: Guidelines Based on Charles Sanders Peirce. Springer International Publishing, 2018. doi:10.1007/978-3-319-98092-8. ↩︎

Appearing Opposed, Related Goals

2022-04-12T16:02:49-04:00

Pain can simplify point of view. When you’re in pain, it’s hard to think of anything else.

Pleasure too can simplify point of view. You may feel that nothing is more important than finding a way to make that pleasure last.

We think of pain and pleasure as opposites - pleasure makes us draw its object near, whereas pain impels us to reject its object. They are also similar – they both distract, making rival goals seem small.

Fear and courage each do best by knowing both. Whether on offense or defense, you seek to guess the opponent’s plan.

Sometimes, one of two seeming opposites is nothing but the absence of the other: sound and silence, light and darkness, interest and unconcern.

In appearing opposed, two things may serve related goals, or may engage selfsame agencies.

In contexts of data and code, opposition may signal an opportunity for structural sharing, for reuse.

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

Destructive Acts Serving Constructive Goals

2022-04-11T15:19:36-04:00

Let’s say that the urges of the Play process compete with those of other processes, like Sleep:

If Sleep wrests control, then perhaps a Wrecker-process urge, previously constrained and now freed from Play’s constraint, need only persist for one more kick to gain the satisfaction of a final crash:

This destructiveness may seem senseless, but it may serve to communicate frustration at the loss of a goal, and to serve constructive goals by leaving fewer problems to be solved – the kick may leave a mess “outside”, yet it may tidy the process orchestration.

It isn’t true that when Sleep starts, Play must quit and all its agents have to cease. A mind can “go to bed, yet still build towers in its head.”

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

Hierarchies, Heterarchies, and Agent Memory

2022-04-10T10:34:01-04:00

In a hierarchy, each agent only acts on behalf of one other agent:

from rdflib import Graph
from rdflib.namespace import RDF, PROV

def hierarchy(graph):
    return all(
        len(set(graph.objects(agent, PROV.actedOnBehalfOf))) <= 1
        for agent in graph.subjects(RDF.type, PROV.Agent)
    )

hierarchy(Graph().parse(data="""
@prefix prov:  .
@prefix :      .

:doc
    a prov:Agent;
.

:sleepy
    a prov:Agent;
    prov:actedOnBehalfOf :doc;
.

:sneezy
    a prov:Agent;
    prov:actedOnBehalfOf :doc;
.

""")) # True

Hierarchies do not always work. Sometimes, for example, agents need to use each other’s skills:

def heterarchy(graph):
    return not hierarchy(graph)

heterarchy(Graph().parse(data="""
@prefix prov:  .
@prefix :      .

:doc
    a prov:Agent;
.

:sleepy
    a prov:Agent;
    prov:actedOnBehalfOf :sneezy, :doc;
.

:sneezy
    a prov:Agent;
    prov:actedOnBehalfOf :sleepy, :doc;
.

""")) # True

In heterarchies, per-agent working memory becomes critical: an agent must keep track of what next to do in a job A if it starts a job B before A is done. In hierarchies, priority and preemption can straightforwardly waterfall down from the top supervisor.

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

Agents and Hierarchies

2022-04-08T10:39:46-04:00

Designing any society, be it human or computational, involves decisions like these:

Which agents choose which others to do what jobs?
Who will decide which jobs are done at all?
Who decides what efforts to expend?
How will conflicts be settled?

Furthermore, roles even in a hierarchy are always relative. To Builder, Add is a subordinate, but to Find, Add is a boss. As for yourself, which sorts of thoughts concern you most – the orders you are supposed to take or those you’re supposed to give?

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

The Principle of Noncompromise

2022-04-07T09:52:04-04:00

The longer an internal conflict persists among an agent’s subordinates, the weaker becomes that agent’s status among its own competitors. If such internal problems aren’t settled soon, other agents will take control and the agents formerly involved will be “dismissed.”

Whenever several agents have to compete for the same resources, they are likely to get into conflicts.

Those agents’ superiors, too, may be under competitive pressure and likely to grow weak themselves whenever their subordinates are slow in achieving their goals, no matter whether because of conflicts between them or because of individual incompetence.

However, an agency that has “lost control” can continue to work inside itself – and thus become prepared to seize a later opportunity.

Must every “mind” contain some topmost center of control? Not necessarily. We sometimes settle conflicts by appealing to superiors, but other conflicts never end and never cease to trouble us.

Good human supervisors plan ahead to avoid conflicts in the first place, and – when they can’t – they try to settle quarrels locally before appealing to superiors. But tiny mental/computational agents simply cannot know enough to be able to negotiate with one another or to find effective ways to adjust to each other’s interference. Only larger agencies could be resourceful enough to do such things, to become versatile enough to negotiate by offering support for its subordinates’ goals.

“Please, Wrecker, wait a moment more till Builder adds just one more block: it’s worth it for a louder crash!”

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

Migrating Conflicts Between Agents to Higher Levels

2022-04-06T16:46:31-04:00

Many children not only like to build, they also like to knock things down – to hear the complicated noises and watch so many things move at once.

Let’s imagine a sibling agent to Builder called Wrecker, whose specialty is knocking things down:

Suppose Wrecker gets aroused, but there’s nothing in sight to smash. Then Wrecker will have to get some help – by putting Builder to work, for example.

But what if, at some later time, Wrecker considers the tower to be high enough to smash, while Builder wants to make it taller still? Who could settle that dispute?

Is the decision left to Wrecker, who activated Builder in the first place? What if both were activated by a higher-level agent, Play-with-Blocks? What if that agent in turn was activated by a Play agent, who may be in conflict with Eat and Sleep?

A child’s play is not an isolated thing. It always happens in the context of other real-life concerns. Whatever we may chose to do, there are always other things we’d also like to do.

In single-thread, synchronous computer programming, prolonged conflict may be avoided. A function A may call another function B in its body and wait for B to return. If B encounters trouble, it can raise an exception for A to catch. The chain of command for conflict resolution can be clear.

In the case of asynchronous, independent agents like Builder and Wrecker, what is the effect of prolonged conflict? Perhaps the conflict tends to weaken their mutual superior, Play-with-Blocks. In turn, this could reduce Play-with-Blocks’s ability to suppress its rivals. Next, if that conflict isn’t settled soon, it could weaken Play at the next-highest level. Then, Eat or Sleep might seize control.

Data, data everywhere, but not a value to validate…

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

Are People Machines?

2022-04-05T14:43:25-04:00

Are people machines?

“Everyone knows that machines can behave only in lifeless, mechanical ways.”

This objection seems reasonable. A person ought to feel offended at being likened to any trivial machine. But it seems to me that the word “machine” is getting to be out of date.

We ought to recognize that we’re still in an early era of machines, with virtually no idea of what they may become. What if some visitor from outer space had come a billion years ago to judge the fate of earthly life from watching clumps of cells that hadn’t even learned to crawl?

Our first intuitions about computers came from experiences with machines of the 1940s, which contained only thousands of parts. But a human brain contains billions of cells, each one complicated by itself and connected to many thousands of others.

Present-day computers represent an intermediate degree of complexity. And yet, we continue to use old words as though there had been no change at all. We need to adapt our attitudes to phenomena that work on scales never before conceived. Does the term “machine” take us far enough?

Rhetoric won’t settle anything. In trying to understand what the vast mechanisms of the human brain may do, we can find self-respect in knowing what wonderful machines we are (and what a FAIR, internetworked “electronic brain” could be).

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

GUIs and APIs Are Both Human Interfaces

2022-04-05T11:18:11-04:00

GUIs and APIs are both human interfaces. They both frame perspectives on data/operation service offerings so that human beings can navigate and consume them. The human being in the case of APIs is the application programmer, a subset of users. GUIs are applications, so it is natural to expect an API’s capabilities to be a superset of a corresponding GUI’s — application programmers program the GUI using the API.

Your resource service interface is not necessarily understandable – discoverable, crawl-able – by machines. APIs are generally not machine-actionable interfaces.

Nor is it necessarily wise that a given API be made machine-actionable. This would result in a two-audience problem. With two different target audiences, humans and machines, how could an API serve both well?

I used to think that GUIs were for humans and APIs were for machines. I now have a SICP-esque perspective on APIs: they “must be written for people to read, and only incidentally for machines to execute.”¹

H. Abelson, G. J. Sussman, and J. Sussman, Structure and Interpretation of Computer Programs, 2nd ed. MIT Press, 2002. ↩︎

Easy Things Are Hard

2022-04-04T13:30:15-04:00

In general, we’re least aware of what our minds do best. It’s mainly when systems start to fail that we engage the special agencies involved with what we call “consciousness.”

Accordingly, we’re more aware of simple processes that don’t work well than of complex ones that work flawlessly.

This phenomenon helps to explain the poor performance of many so-called expert systems in the 1980s. There were attempts to fully rationalize human expertise as calculative rules. The effect was often to regress an expert’s knowing how to a novice practitioner’s knowing that.

Skill acquisition in unstructured domains moves not towards abstract rules, but rather from abstract rules to particular cases. And “the distinction between education, a process aimed at drawing out the abilities of the student, and training, in which the student is learning to negotiate a structured domain, is crucial.”

This may help shed light on much of the recent mixed success of “unexplainable” neural-network-based decision systems.

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

Holes and Parts

2022-04-03T11:20:34-04:00

What keeps a mouse contained in a box?

It is the way a box prevents motion in all directions. Each board bars escape in a certain direction. The left side keeps the mouse from going left, the right from going right, the top keeps it from leaping out, and so on.

The secret of a box is simply in how the boards are arranged to prevent motion in all directions. That’s what containing means.

It’s silly to expect any separate board by itself to contain any containment, even though each contributes to the containing. It is like the cards of a straight flush in poker; only the full hand has any value at all.

The same applies to words like life and mind. It is foolish to use these words for describing the smallest components of living things because these words were invented to describe how larger assemblies interact. Like boxing-in, words like living and thinking are useful for describing phenomena that result from certain combinations of relationships.

None of the 15 FAIR principles¹ contain FAIR. A digital resource will not become “more FAIR” when it adheres to one rather than none of the principles.

However, just like life has gradually lost much of its mystery – at least for modern biologists, because they understand so many of the important interactions among the chemicals in cells – FAIR can be demystified by understanding how the components of a well-made FAIR resource interact to facilitate reuse and repurposing.²

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

M. D. Wilkinson et al., “The FAIR Guiding Principles for scientific data management and stewardship,” Sci Data, vol. 3, no. 1, p. 160018, Mar. 2016, doi: 10/bdd4. ↩︎
I’ve been going over the FAIR principles one by one on my podcast. Each such episode has averaged about five minutes. ↩︎

Parts and Wholes

2022-04-01T11:19:24-04:00

We’re often told that certain wholes are “more than the sum of their parts.” We hear this expressed with reverent words like “holistic” and “gestalt,” whose academic tones suggest that they refer to clear and definite ideas. But I suspect the actual function of such terms is to anesthetize a sense of ignorance. We say “gestalt” when things combine to act in ways we can’t explain, “holistic” when we’re caught off guard by unexpected happenings and realize we understand less than we thought we did.

What makes a tower more than separate blocks, or a wall more than a set of many bricks? Every block/brick is held in place by its neighbors and gravity. Why is a chain more than its various links? To explain why chain-links cannot come apart, we can demonstrate how each would get in its neighbors’ way.

In graphical diagrams of such physical situations, the edges drawn between nodes are – implicitly or explicitly – labeled, qualified relations. An arrow is not a mystery – it is, for example, gravitational force.

Sometimes, giving names to things can help by leading us to focus on some mystery. It’s harmful, though, when naming leads the mind to think that names alone bring meaning close.

With Linked Data, all edges relating parts are labeled, and those labels are things, not strings. Such discipline can help us to not fool ourselves.

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

Novelists and Reductionists

2022-03-31T11:18:14-04:00

Some like to focus on the new. They like to invent theories.

Some are adamant about reducing to what has come before. This has worked remarkably well for the core of physics.

These inclinations are not incompatible given some kind of “leveling”, with discipline about connections. Standing on the shoulders of giants and all that.

Much of apparent “novelty” may be reducible to the structured annotation and (re-)configuration of core mechanism. Like how various organisms’ genetic inheritances have been modded over millennia.

Linked Data may be framed as a way to get novelists and reductionists to sit at the table of FAIR together.

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

Components and Connections

2022-03-30T15:05:42-04:00

An agent like Builder is not merely a collection of parts like Find, Get, Put, and all the rest. Builder would not work at all unless those agents were linked to one another by a suitable network of interconnections:

Could you predict what Builder does from knowing just that left-hand list above? Of course not. First, we must know how each separate part works. Second, we must know how each part interacts with those to which it is connected. And third, we have to understand how all these local interactions combine to accomplish what that system does – as seen from the outside.

There is lots of prior art for understanding combinations of component interactions, whether as expression trees or wiring diagrams. Computer programming has traditionally emphasized the former, but note how Move has two “parents” in the diagram above.

I leave you with the intro to the last chapter of ¹:

Decades of programming experience have taken a toll on our collective imagination. We come from a culture of scarcity, where computation and memory were expensive, and concurrency was difficult to arrange and control. This is no longer true. But our languages, our algorithms, and our architectural ideas are based on those assumptions. Our languages are basically sequential and directional – even functional languages assume that computation is organized around values percolating up through expression trees. Multidirectional constraints are hard to express in functional languages.

Escaping the Von Neumann straitjacket

The propagator model of computation provides one avenue of escape. The propagator model is built on the idea that the basic computational elements are propagators, autonomous independent machines interconnected by shared cells through which they communicate. Each propagator machine continuously examines the cells it is connected to, and adds information to some cells based on computations it can make from information it can get from others. Cells accumulate information and propagators produce information.

Since the propagator infrastructure is based on propagation of data through interconnected independent machines, propagator structures are better expressed as wiring diagrams than as expression trees. In such a system partial results are useful, even though they are not complete. For example, the usual way to compute a square root is by successive refinement using Heron’s method. In traditional programming, the result of a square root computation is not available to subsequent computations until the required error tolerance is achieved. By contrast, in an analog electrical circuit that performed the same function, the partial results could be used by the next stages as first approximations to their computations. This is not an analog/digital problem—it is organizational. In a propagator mechanism the partial results of a digital process can be made available without waiting for the final result.

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

C. Hanson and G. J. Sussman, Software design for flexibility: how to avoid programming yourself into a corner. Cambridge: The MIT Press, 2021. ↩︎

Wholes and Parts, in FAIR and Mind

2022-03-29T08:43:41-04:00

It is the nature of the mind that makes individuals kin, and the differences in the shape, form, or manner of the material atoms out of whose intricate relationships that mind is built are altogether trivial.

– Isaac Asimov

It is the nature of FAIR that can make digital resources kin – interoperable in fugues of machine action. The differences in the schema, serialization, or domain-specificity of the digital datoms out of whose intricate relationships any given resource is built are altogether trivial.

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

Agents and Agencies

2022-03-28T23:42:18-04:00

We want to explain complicated things as a combination of simpler things. You must be prepared to feel a certain sense of loss. When we break things down to their smallest parts, they may each seem as dry as dust at first, as though some essence has been lost.

Where does the “knowing-how-to-build” of a Builder agent reside? It is not in any part, so it is not enough to explain what each separate agent does. We must understand how parts are interrelated – how groups of agents can accomplish things.

Seen by itself, as an agent, Builder is just a simple process that turns other agents on and off. Seen from outside, as an agency, Builder does whatever all its sub-agents accomplish, using one another’s help:

https://files.polyneme.xyz/dropshare/agents-and-agencies-xWtuG28LbO.png" />

Builder seems to lead a double life. As agency, it seems to know its job. As agent, it cannot know anything at all.

And knowing how is not the same as knowing that. If while performing an activity expertly you find yourself consciously reflecting on what you are doing and the rules for doing it, chances are you will experience a severe degradation of performance – you fall victim to “knowing that” as it interrupts and replaces your “knowing how”.

Hermans ¹ highlights three kinds of confusion when reading code: lack of knowledge, lack of information, and lack of processing power. These also reflect a gradual descent from know-how focus to know-that focus, from seeing agencies to tracing the wiring-together and actions of individual agents.

So it may be with a society of FAIR digital resources – agents may be distributed, and yet agency may be coherent.

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

F. Hermans, The Programmer’s brain: what every programmer needs to know about cognition. Shelter Island, NY: Manning, 2021. ↩︎

Common Sense

2022-03-27T09:32:59-04:00

We found a way to make a tower builder out of parts. But Builder is really far from done.

For example, how could Find determine which blocks are still available for use? It would have to “understand” the scene in terms of what it is trying to do.

We’ll need theories both about what it means to understand and about how a machine could have a goal.

Consider all the practical judgments that an actual Builder would have to make. It would have to decide whether there are enough blocks to accomplish its goal and whether they are strong and wide enough to support the others that will be placed on them.

By the time we are adults, we regard all of this to be simple “common sense”. But that deceptive pair of words conceals almost countless different skills.

Common sense is not a simple thing. Instead, it is an immense society of hard-earned practical ideas – of multitudes of life-learned rules and exceptions, dispositions and tendencies, balances and checks.

Dreyfus calls the ability to intuitively respond to patterns without decomposing them into component features “holistic discrimination and association”¹:

When things are proceeding normally, experts don’t solve problems and don’t make decisions; they do what normally works.

As each new group of skills matures, we build more layers on top of them. As time goes on, the layers below become increasingly remote until, when we try to speak of them in later life, we find ourselves with little more to say than “I don’t know.”

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

H. L. Dreyfus and S. E. Dreyfus, Mind over machine: the power of human intuition and expertise in the era of the computer. New York: The Free Press, 1988. ↩︎

The World of Blocks

2022-03-25T13:40:35-04:00

Imagine a child playing with blocks. Imagine the child’s mind contains a host of smaller minds - “agents”. Imagine an agent called Builder in control. Builder makes towers from blocks:

Builder is a simple agent. It needs help from several other agents to choose a place to start the tower, add a new block to the tower, and decide when it is high enough:

But doesn’t Add have a big job as well, too big for a single, simple agent? First Add must Find a new block, then the hand must Get that block and Put it on the tower top:

Why break things into such small parts? Because minds, like towers, are made that way – except they’re composed of processes instead of blocks. Scientific workflows are also made that way.

Does making stacks of blocks seem insignificant? You probably didn’t always feel that way. You may have spent joyful weeks in early childhood on building stacks of blocks. As grown-ups, we all know how to do such things, but how did we learn to do them?

This forgetfulness, this amnesia of infancy, applies to expert scientific practice as well. All our wonderful methods weren’t always inside our minds – they began and grew, like we begin and grow computer-simulation programs and data analysis pipelines.

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

The Society of Method

2022-03-24T13:00:48-04:00

For a method, a protocol, thought about and done by you, what’s a “you”? What kinds of smaller entities may cooperate to carry out a procedure? Try this: pick up a cup of tea. Imagine that

your GRASPING agents want to keep hold of the cup,
your BALANCING agents want to keep the tea from spilling out,
your THIRST agents want you to drink the tea, and
your MOVING agents want to get the cup to your lips.

If each does it’s own job, the really big job will get done by all of them together: drinking tea.

We’re always doing several things at once, like planning and walking and talking. These processes involve more machinery than anyone can understand at once.

In the next few notes, we’ll focus on one ordinary activity - making things with children’s building blocks. In doing this, we’ll try to imitate how Galileo and Newton learned so much by studying the simplest kinds of pendulums and weights, mirrors and prisms.

By focusing a microscope on simple objects, we hope to open up a great and unexpected universe, the same reason why so many biologists nowadays devote more attention to tiny germs and viruses than to magnificent lions and tigers. The work of artificial-intelligence researchers with children’s blocks can be the prism and the pendulum for studying scientific procedure.

In science, one can learn the most by studying what seems the least.

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

The Research Process and the Bits

2022-03-23T14:34:09-04:00

How could solid-seeming computer hardware support such a ghostly thing as research progress? The world of scientific inquiry and the world of bits appear too far apart to interact in any way.

A few centuries ago it seemed impossible to explain Life - living things appeared to be so different from anything else. Last century, von Neumann helped show how cell-machines could reproduce. Watson and Crick helped show how cells make copies of their hereditary code.

How does scientific process work? Gödel and Turing helped reveal the range of what machines could be made to do. McCulloch and Pitts in the 1940s began to show how machines might be made to see, reason, and remember.

We are still far from being able to create machines that do all of the scientific work that people do. But we also need better theories about how knowledge is acquired and applied, how data is interpreted, etc. Tiny machines, “agents”, could be “particles” helpful in constructing such theories and operationalizing them.

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.

Donny Winston

For each paper with an author from my institution, which of that paper's authors are from my institution?

Feeding the Scholarly Need

Community vis-à-vis Forum

Don't archive you assets — frontier them

Modeling a Graphical Expression of Materials Data (GEMD)

model-memo

model-diagram

model-formalism

A Model-Expression Workflow for Connected Content

Key Technical Foundations for FAIRifying Data

ubiquitous persistent identifiers (PIDs)

rich controlled metadata

granular programmatic access

Implementing the FAIR Principles Through FAIR-Enabling Artifacts and Services

Architecture Patterns for FAIR-Enabling Services

Relating domain-driven design, event-driven microservices, command-query responsibility segregation (CQRS) + views, and validation (of syntax, semantics, and pragmatics)

From Platforms to Microservices for FAIR Data and Analysis

Entrypoints wrap services (orchestration, infrastructure, glue code) that wrap domain conceptualizations.

FAIR-Enabling Services Redux

Translating Identifiers

Indexing Translators and Traces

Indexing Validators

Indexing Identifiers

Validating Traces — Syntactically, Semantically, and Situationally

A Disconnect Between FAIR Infrastructure Devs and Product Devs

Validating Translation

Semantic Stars Upon Thars

Who Validates the Validators?

Identifying Validation

Interview with Martynas Jusevičius

Quotable Quotes

Sharing is caring!

Tracing Identifiers

Translating Identifiers

Indexing Identifier Services

Validating an Identifier Service

Identifying Identifying

Identifying some concepts (attributes? relationships?) about identifying

A Five-Week Experiment to Elaborate on FAIR-Enabling Services

FAIR-Enabling Services

FAIR-Enabling Resources - Identifier Services

Schema Translation Infrastructure

Initial Scribbles

Layered Architecture for the Internet

Layered Architecture for Schema Translation

Schema Translation Infrastructure in Action

A Perlisism for Identifiers: Delay Binding

When Do Developers Not Have to Talk to Stakeholders?

Principles for Robustly Interoperable Digital Objects

Shotgun Semantics

Interview with Shreyas Cholia

Quotable Quotes

Sharing is caring!

Method and Structure

Interview with Patrick Huck, on implementing FAIR for computed materials data

Talking Points

Quotable Quotes

Sharing is caring!

"My Data Model Is JSON"

A FAIR Digital Object - Inching up the Hourglass

Validation: Syntax, Semantics, and Pragmatics

High-Precision Content Classification Using Hierarchy

Taxonomy Pruning for Query Classification

An Objective Function for Code Refactoring

Complexity Is Carbon

These Are All Just Persistent URLs, No?

The ARK System of Persistent Identifiers (PIDs)

The Handle System of Persistent Identifiers

What globally unique, persistent, resolvable identifiers do you use for datasets?

Findability → Known-Item Search, Discoverability → Exploratory Search?

Is an Ontology 'better' than a Relational Data Model?

Leave Beacons in Code

CFF for Machine-Actionable Software Citations

PageRank of Linked Open Vocabularies (LOV)

Lean Web - Principles of Lean Thinking applied to Web Development

Hallucinating Datasets Across Epochal Time

¬ consistent ⇒ ¬ valid ⇒ ¬ accurate

W3C data recommendations -- there are many!

Data Stacks for FAIR