Data Reduction for Science

Earlier this week, I wrote that

In sharing scientific research data, the goal is often to provide data reductions to the extent possible without loss – the output is, in a strong sense, equivalent to the input. Any further reduction that may be necessary to support decisions and policy can be done as fusion beyond unification.

As luck would have it, the U.S. Department of Energy (DOE) posted a funding opportunity announcement (FOA) yesterday on Data Reduction for Science:

Scientific observations, experiments, and simulations are producing data at a rate beyond our capacity to store, analyze, stream, and archive. This data almost always contains redundancies and trivialities that hide the important information of interest to scientists. Of necessity, many research groups have already begun reducing the size of their data sets…These efforts should be expanded to include mathematical rigor to ensure that scientifically-relevant constraints on quantities of interest are satisfied, to be integrated into scientific workflows, and to be implemented in a manner that inspires trust that the desired information is preserved.

There have been efforts for decades to identify and deal with this issue, with cute acronyms for relevant data like ROT (Redundant Obsolete and Trivial), WORN (Write Once Read Never), and WORSE (Write Once Read Seldom if Ever). However, the DOE FOA highlights that it is not enough to separate storage (“hot”, “cold”, etc. tiers) for research data – we must seek to avoid storage of ROT data.

Subscribe to get short notes like this on Machine-Centric Science delivered to your email.