Data Stacks for FAIR
I noticed a pattern at the top of each case study listed by Stemma.ai, which provides data catalog software as a service based on the open-source Amundsen code. Each case study’s so-called “Data Stack” comprises up to four distinct categories of functionality – Data Catalog, Data Warehouse, ETL, and Business Intelligence.
The “Data Stack” for each case study:
Case | Data Catalog | Data Warehouse | ETL | Business Intelligence |
---|---|---|---|---|
Lyft | Amundsen | Presto | Apache Airflow | Mode,Apache Superset |
Convoy | Stemma | Snowflake | dbt, Apache Airflow | Tableau, Metabase |
iRobot | Stemma | Amazon Athena | (blank) | Mode |
ING | Amundsen | Trino (formerly, Presto SQL) | (blank) | Apache Superset |
These categories struck me in relation with the FAIR Principles1:
- A Data Catalog is about making data Findable.
- A Data Warehouse is about making data Accessible.
- An ETL platform, aka a Data Orchestration2 platform, is about making data Interoperable.
- A Business Intelligence (BI) tool is about making data Reusable, aka Repurposeable.
It’s encouraging to see high-level alignment between the FAIR Principles and a conceptualization of useful enterprise data systems in the corporate world.
References
M. D. Wilkinson et al., “The FAIR Guiding Principles for scientific data management and stewardship,” Sci Data, vol. 3, no. 1, p. 160018, Mar. 2016, doi: 10/bdd4. ↩︎
Although a term I think may be more apt here than Data Orchestration, which has an imperative tone, is Data Reconciliation, which has a declarative tone – see e.g. S. Ryza, “Introducing Software-Defined Assets”, Dagster Blog, Mar. 2022. https://dagster.io/blog/software-defined-assets (accessed May 31, 2022). ↩︎