TPC-DS Query 5
- Pull Request
- Play with the query and results using the Fiddle.
Query 5 was somewhat challenging to implement, and initially required some functionality that didn’t yet exist in Malloy. As luck would have it, the necessary feature was literally in the works as I was asking for it, and landed a couple of days later.
The query computes the total profit (sum of sales) and total loss (sum of returns), aggregated at three different levels of granularity: overall, by channel category (store, web, or catalog), and by channel ID (e.g., catalog page, store ID, or web site). It does this by liberal use of the UNION ALL
operator, first unioning sales and returns for each of the channel categories, then again unioning the combined outputs of each category to create an overall result.
Malloy doesn’t currently have a native UNION ALL
construct. It does however provide the ability to construct a Source based off of a SQL query, which is a useful escape hatch in cases like this where a specific piece of functionality isn’t available. I started out writing a fairly large SQL block to perform the UNION ALL
between each of the store
, web
, and catalog
tables, but the web
subquery (lines 75-77 in the SQL query) contains a JOIN
clause to get the web site for a particular return. It felt a bit like cheating to include this join in a SQL block, since in my mind, any joins should be expressed as part of the semantic model. Lucky for me, the Malloy team recently released what they called the “SQL Turducken”, or the ability to embed a reference to a Malloy query inside of a SQL block. This would allow me to define the join between web_returns
and web_sales
in a Malloy source, and then perform the UNION ALL
within the SQL block. Each of catalog
, web
and store
has a set of sources defined like the following:
I UNION
them together to get all_sales
and all_returns
, combining each of the channel categories into a single table for sales and single table for returns:
I used one final Turducken to create a single source that combines data from the sales
and returns
sources, and defined aggregation queries for each of the two levels of granularity, channel_category
and channel_id
The nested aggregations in the Malloy query are much nicer than the equivalent SQL GROUP BY ROLLUP
output. I wrote about the problems with GROUP BY ROLLUP
here. In a nutshell, because the value column (e.g., sales
or profit
in the query above) of a GROUP BY ROLLUP
result contains values at different levels of granularity, the output cannot be used by downstream queries unless filters are applied to restrict the output to a single level of granularity. Malloy avoids this by using struct
data types in nested columns to embed sub-tables. This allows Malloy to respect the hierarchical relationship of data, and most crucially, preserves the composability of resultsets. The output of a nested Malloy query can be used freely by downstream queries or transformations without having to worry about hidden traps.
Thanks for tuning in! As always, any feedback, comments, or questions are welcome. Drop me a line on the Malloy Slack, username @carl.