Data Stacks for FAIR

I noticed a pattern at the top of each case study listed by Stemma.ai, which provides data catalog software as a service based on the open-source Amundsen code. Each case study’s so-called “Data Stack” comprises up to four distinct categories of functionality – Data Catalog, Data Warehouse, ETL, and Business Intelligence.

The “Data Stack” for each case study:

CaseData CatalogData WarehouseETLBusiness Intelligence
LyftAmundsenPrestoApache AirflowMode,Apache Superset
ConvoyStemmaSnowflakedbt, Apache AirflowTableau, Metabase
iRobotStemmaAmazon Athena(blank)Mode
INGAmundsenTrino (formerly, Presto SQL)(blank)Apache Superset

These categories struck me in relation with the FAIR Principles1:

It’s encouraging to see high-level alignment between the FAIR Principles and a conceptualization of useful enterprise data systems in the corporate world.


References

  1. M. D. Wilkinson et al., “The FAIR Guiding Principles for scientific data management and stewardship,” Sci Data, vol. 3, no. 1, p. 160018, Mar. 2016, doi: 10/bdd4. ↩︎

  2. Although a term I think may be more apt here than Data Orchestration, which has an imperative tone, is Data Reconciliation, which has a declarative tone – see e.g. S. Ryza, “Introducing Software-Defined Assets”, Dagster Blog, Mar. 2022. https://dagster.io/blog/software-defined-assets (accessed May 31, 2022). ↩︎