The "20 Queries" Heuristic

A scientific database cannot be everything to everyone. Jim Gray came up with the “20 queries” heuristic. What are the 20 most important questions the researchers want the data system to answer?1

Five questions are not enough to see a broader pattern, and 100 questions would dilute focus. Also, the relative information in queries ranked by importance is likely to be logarithmic – a “long tail” distribution. Thus, you’re not likely to get five times more information collecting 100 versus 20 queries.

The goal of this step in a design process is to bridge the semantic gap between the vocabulary used in the scientific domain and the schema of the database, and to help domain scientists and database engineers discuss design trade-offs that result in performance trade-offs.

Example Exercise

I went through this exercise for an existing database, that of the Materials Project (MP).2 MP has a discussion forum3, which has been active for a few years. Tens of threads have over a thousand views each. I went through the threads, sorted by descending view count, to pick out 20 queries posed.

Obviously, many of the found queries are residual queries, that is, queries that post authors were not able to answer clearly using MP’s existing interface. However, some of the queries were straightforward to answer via the existing system, as evidenced by replies to those posts. In either case, the queries I picked are a collection both sought by users of the system and viewed by many other active or potential users.

I focused on queries not about the methodology of data collection, but about obtaining data, including metadata pertaining to understood methodology, e.g. parameter values supplied to programmatic methods.

I collected approximately 20 queries – I stopped myself at 24. Some I formatted as questions, other as stated desires/intentions. I then tried to identify clusters and thus intended query patterns. I identified six clusters4, and I removed one query as a duplicate of another. Here are my results:

This is only one step in a design process, but I hope this example helps you better understand the “20 queries” heuristic and how you might apply it to your work in designing data systems appropriate for domain specialists.

  1. Szalay and Blakeley, “Gray’s Laws: Database-centric Computing in Science”, in The Fourth Paradigm : Data-Intensive Scientific Discovery. Microsoft Research, 2009. ↩︎

  2. ↩︎

  3. ↩︎

  4. I aim to chunk items recursively in groups of at most 5±2, a hedge on Miller’s 7±2. ↩︎