Request Your Cake and Eat It Too

I was alerted to a great discussion1 about unifying versus partitioning data models. That is, you have some data powering some part of your system, and you need to decide how to structure that data and associate validation logic and behavior – like calculating properties or triggering other system actions – with the bundle of data as it evolves over the course of some ongoing or completing process.

The concrete example given is an application with a /cakes endpoint. You might POST a JSON object to it with {name,ingedients,inventor_id} attributes, and get back a JSON object with {name,ingredients,id,status} attributes. When you create a cake, the cake doesn’t have an id in the system, and it doesn’t have a system status, e.g. “baking”. Also, there may be cake attributes that aren’t assumed to be relevant in a response, e.g. a system ID for the cake’s inventor.

Approaches to data modeling here include (1) having e.g. a CreateCakeRequest model for the POST body and a CreateCakeResponse model for the response, and (2) having one Cake model that gets passed around and updated. The former approach is one of partitioning and hand-off. The latter is one of unifying the state space into a single data structure.

There are two other approaches I’ve seen and implemented that I guess technically take approach (2), but also dispatch to approach (1) “under the hood”. One tries to be principled and explicit about the under-the-hood partitioning, and the other just kinda wings it emphasizes flexibility.

The principled, explicit-partitions approach is exemplified by statecharts in that your data model is a unified “machine”. Below I show a little Python pseudocode that reflects XState’s JSON representation of a model. There is a privileged data attribute called state that the machine runtime uses as a switch in order to respond to incoming events/requests of the model. In this way, the “one” model dispatches to finitely many little models, all of which have access to the context structure.

create_machine({
    "id": "cake-machine",
    "initial": "emptiness",
    "context": {
        "inventor_id": None,
        "ingredients": [],
        "name": None,
        "cake_id": None,
    },
    "states": {
        "emptiness": {
            "on": {
                "CREATE": {
                    "target": "baking",
                    "actions": assign({
                        "inventor_id": "set_inventor_id",
                        "name": "set_name",
                        "ingredients": "set_ingredients",
                        "id": "set_id",
                    }),
                }
            }
        },
        "baking": {
            "on": {
               1800: {  # 30 minutes (30 * 60 seconds)
                   # ... check on cake ...
               } 
            }
        }
    }
}, {
    "actions": {
        "set_inventor_id": lambda context, event: event.inventor_id,
        "set_name": lambda context, event: event.name,
        "set_ingredients": lambda context, event: event.ingredients,
        "set_id": lambda _c, _e: generate_cake_id(database),
    }
})

A similar explicit-partitions approach is customary for Elm applications, where a unified model is fed along with a message, i.e. an event/request, to a unified update function that again uses case analysis to switch on the appropriate subspace of logic. In statecharts, the switch is on the state attribute, whereas for Elm logic, the switch can be two-staged on the message type or model variant, as both may be custom tagged-union types that are simple to switch on.

A less partition-oriented yet flexible approach is exemplified by the use of models that cast/conform input data to their own little worlds and yet get persisted to the same underlying unified data structure. I took this approach recently in implementing a superset of the Global Alliance for Genomics and Health’s (GA4GH) Data Repository Service (DRS) API in Python using Pydantic and FastAPI.

In the GA4GH DRS API, a DrsObject is the central resource. It tells you about some bytes you can access somehow, like via a URL. Since I added endpoints to create such objects and to tag them with “types” that are outside the scope of the current API spec, I needed to have different models that present different interfaces to the same underlying data, like a bunch of people all having different perspectives when standing around an elephant. Pydantic helped with this because you can pass in extra fields to a model constructor, and it simply ignores the extras. And FastAPI helped because you can assign Pydantic models to API request and response objects per endpoint, which means I can yield the whole enchilada from an endpoint function and FastAPI will feed it to the registered Pydantic response model to be slimmed down / conformed for the client as per the OpenAPI contract.

Concretely, I have DrsObjectBase, DrsObjectIn(DrsObjectBase), and DrsObject(DrsObjectIn) Pydantic models (code here) that power a family of FastAPI /objects endpoints (code here), and complementary /object_types models (code) and endpoints (code). All of the coordinated models (including validation via Pydantic @*validator-decorated methods) and representational state transfer facilitated by the API are backed by a MongoDB collection that stores one JSON object per DrsObject, including an embedded array of types that is shielded from the read-only, spec-adhering DRS API endpoints.

Would I be better off with a partition-oriented data modeling approach for the DRS API implementation? I’m not sure. In any case, it would need to be done “under the hood”, e.g. with a _state attribute or similar on the underlying MongoDB document in my case.


References

  1. Topic on the Recurse Center’s Zulip realm. Private to Recursers, but if this post interests you, consider applying for a batch! Ping me if you have any questions. ↩︎