Interview with Shreyas Cholia
This week on Machine-Centric Science, I interviewed Shreyas Cholia, currently at the Lawrence Berkeley National Laboratory in Berkeley, California.
Topics we spoke about included: data lifecycles, edge computing for data firehoses, provenance, standards, broad versus detailed domain vocabularies, scope for common APIs, and identifier leveling.
Quotable Quotes
“Maybe what that really means is that this publication step so to speak just needs to be pushed further upstream”
“Maybe it’s just conceptualizing the data lifecycle as being not so much a linear thing as much as it is just a bunch of different steps that could be applied to the data at different stages, and really any of those steps could happen at any time.”
“There’s a little bit of a disconnect right now…each domain tends to have a lot of detail that gets obscured by these high-level specifications…we’re seeing some interesting friction…things that evolved from different spaces, it’s interesting to see how they’re trying to come together now.”
“The holy grail is…everyone can look at everything and everyone can talk to each other…in this dataset, that’s what this column means and that’s what this field means and that’s how I can compare these two things.”
“There’s a lot more to harmonization than just making sure things are in the same unit.”
“The driving force here is more about machine readability and machine interpretability of the data.”
“That one’s tricky…it’s a little bit of a moving target in terms of where you see scientific value occurring.”
“So much of what matters is at the metadata level…If that’s different for different domains, which it will be, having the ‘one API to rule them all’ doesn’t really make a lot of sense.”
“At the highest level, DOIs are great…there are, though, a lot of identifiers that are kind of not ‘DOI-level’ identifiers…more low-level for tracking and provenance…down to the level of the individual datum…a row in a spreadsheet, or a single JSON object.”
“It’s never too late to start thinking about coming together and trying to standardize your data…Please also spend a lot of time seeing what’s out there and trying to work with existing standards and trying to be a part of the broader ecosystem rather than doing your own thing.”
Sharing is caring!
If you enjoyed this episode, please consider sharing it with a few friends who might find it useful. Thanks!