Are You Just Creating Another Data Silo?

You have a data-intensive research problem. Custom software will help you solve it. Code is written. A dataset is collected to feed the code. Did you just create another silo?

What would it mean to be data-centric, with only one data platform and with applications on top, where applications come and go?

In your group, who decides what kind of database to use? If you give a student a task, and say to please go solve it, they’re going to pick the easiest tool to use. Off they go, and after some time the task is done. They’ve created a new silo, a siloed dataset or application. They didn’t think about how their new application is going to work with every other application relevant to your group. That was not their role. They just got the task to solve a specific problem. They’ll publish an article, mint DOIs for the dataset and codebase, and that’s that.

Are you perpetuating a culture of creating a new silo for every problem? At the core of the Semantic Web is the notion of caring deeply about identity, the identity of objects. For intelligent data integration, you have to be sure that you give the same thing the same name, and you need clear mechanisms to relate things. You can’t ask group members to make these decisions on their own.

This post was adapted from a note sent to my email list on Scientific Data Unification.
I'd love for you to subscribe.