Data Stacks for FAIR | Donny Winston

I noticed a pattern at the top of each case study listed by Stemma.ai, which provides data catalog software as a service based on the open-source Amundsen code. Each case study’s so-called “Data Stack” comprises up to four distinct categories of functionality – Data Catalog, Data Warehouse, ETL, and Business Intelligence.

The “Data Stack” for each case study:

Case	Data Catalog	Data Warehouse	ETL	Business Intelligence
Lyft	Amundsen	Presto	Apache Airflow	Mode,Apache Superset
Convoy	Stemma	Snowflake	dbt, Apache Airflow	Tableau, Metabase
iRobot	Stemma	Amazon Athena	(blank)	Mode
ING	Amundsen	Trino (formerly, Presto SQL)	(blank)	Apache Superset

These categories struck me in relation with the FAIR Principles¹:

A Data Catalog is about making data Findable.
A Data Warehouse is about making data Accessible.
An ETL platform, aka a Data Orchestration² platform, is about making data Interoperable.
A Business Intelligence (BI) tool is about making data Reusable, aka Repurposeable.

It’s encouraging to see high-level alignment between the FAIR Principles and a conceptualization of useful enterprise data systems in the corporate world.

References

M. D. Wilkinson et al., “The FAIR Guiding Principles for scientific data management and stewardship,” Sci Data, vol. 3, no. 1, p. 160018, Mar. 2016, doi: 10/bdd4. ↩︎
Although a term I think may be more apt here than Data Orchestration, which has an imperative tone, is Data Reconciliation, which has a declarative tone – see e.g. S. Ryza, “Introducing Software-Defined Assets”, Dagster Blog, Mar. 2022. https://dagster.io/blog/software-defined-assets (accessed May 31, 2022). ↩︎