Data Lifecycles

September 17, 2020

I was asked about a positioning statement for my business, to describe who I work with and what outcome I help them with: something like “Helping X with Y”. I gave this:

Helping researchers with data discovery, integration, and dissemination.

The response I got was “Great start. What type of researchers would be an ideal fit?” I had thought I had specialized my audience sufficiently: I’m not helping “people”, I’m helping researchers. But I can go further, not to limit with whom I work, but to focus my marketing effort for the time being. Also, “research” is pretty inclusive. There are marketing researchers, investment researchers, sociology researchers, etc.

An ideal fit for me is materials researchers. I have a lot of experience in this domain, from small-batch experimental processing and characterization in clean-room environments to high-throughput simulations and analyses in supercomputing centers. The challenges and applications excite me, and I like the people.

Okay, so “Helping materials researchers with data discovery, integration, and dissemination”. That last part seems like a mouthful. Can I focus again, and pick one of the three as most important for now? I thought about this, and it was hard to pick one because they seem so related. How so? And can this lead to a cogent synthesis – a “one thing”?

Discovery, integration, and dissemination of data happen in sequence for a given study. There may be some iterative looping back of course, but you first identify somehow any existing data you can use, in addition to whatever data you intend to collect deliberately by experiment or simulation. Next, you must ensure the existing, discovered data is interoperable with your own data and integrate them for analysis. Finally, you disseminate the data, hopefully as more than just text and figures in an article.

The degrees of urgency for these stages are distinct in a given study. Data discovery is of medium urgency – it doesn’t really block the later steps. Perhaps you don’t take as much advantage of previous work as you’d like to. You’d like this to be done well so that it enriches the quality of the study. However, you don’t want it to hold up the rest of the study; it has some urgency, but you also feel like you can loop back later.

Data integration is of high urgency – it blocks progress. You get stuck on analysis if you can’t combine the data you need in a suitable way. You can’t disseminate results because you can’t produce them.

Data dissemination is of low urgency. Was the manuscript accepted? Great. Gotta get going on the next study. Making the final data easily findable, accessible, interoperable, and reusable by others is a nice-to-have, but it doesn’t block the study. The study’s “done”.

Here’s the thing: Great data dissemination makes that data discoverable and integrable. An ideal discovery process is one that facilitates continuous integration. And an ideal integration process is one that makes the outputs of analyses easy to reuse and by extension to disseminate. Data has a lifecycle, from discovery, to integration, to dissemination, to re-discovery – by others in the research community, by collaborators, or by someone else in your lab – or you – six months or more from now.

I help materials researchers with data lifecycle management. Yes, it needs unpacking, but it’s a single, whole outcome, and I hope it makes sense to you.