Framing Aggregates with WINDOW Functions
Welcome! This session assumes an intermediate understanding and usage of SQL
WINDOW functions in SQL allows to change the lens of the same data we are querying by adding context to the values being used. By design, window functions help you aggregate numbers or metadata so that each data row also contains general information related to the attributes
What we're doing today
By the end of today, you will have become familiar with two window functions in SQL that allow you to group and aggregate values
You will have the chance to follow along
Let's Get Into It
Loading and Previewing Data
Before we do anything to our data, let's see and understand it! By the end of this, we will know what our data table looks like, what each column represents, and what the smallest grain is i.e. what combination of attributes makes a row unique
Notice we have 135,369 rows in our table. This is going to be important in understanding the difference between a regular aggregation query and a window function aggregation query. Let's begin!
Now, our functions
Let's try it!
ORDER BY - How would we track the volume of passengers changes each day by departure airport?
There are three parts of grouping to consider. We want
First, the typical grouping usage would look something like this
Remember the number of rows? In an aggregation, we lose grains -> we've gone from 135,369 rows to 220 rows and can't select the details of each passenger and route on each day. Keep this fact in your back pocket for now. Let's stay at this aggregate table and assume we got this as our starting dataset - just a count of departure totals.
Now what if we wanted to get the change over time? In the above case, you may create a new column after downloading this. Maybe in Excel with a pivot table. But what if we could get the cumulative number of customers each day with SQL?! Here come our window functions!
What did we just do?
You could show more information in this example by adding the destination field to your selection. This does not change the math, it extends the scope of your row. Remember - we are not hiding information with windows, we are improving our vantage point.
Putting it all together
Let's combine the two previous steps into one. We now want to start from our original table of 135,369 rows, and select two values along with each row
Notice: you don't have to use a GROUP BY function with the aggregation above because you are not reducing rows (we are still at 135,369 rows). You are adding more information to each row. Hence the window opening 😆 That means, if you were looking at any single passenger's itinerary on a given day, you will also be able to see how many passengers total left that same passenger's departure airport (regardless of destination) on that day, and how many have departed so far up to and including that day.
‼️ Once again, be mindful to add your fields in the order you want them to be partitioned and ordered.
Now let's select one departure so we can see how the count changes with each day to make sure we got what we wanted
We can see for each of our 20 days in October, the count for each day's departure from Detroit, and the running total as the days increment!
Let's try this again with the route (departure - destination combination)
The PARTITION BY clause is extendable by adding more variables in the order with which you want to partition AND order. Now we can group by not just the departure, but also the destination AND keep accumulating by the day. This is a smaller slice to aggregate, but again - we are not changing the number of rows, we are just adding more information to it.
Our selection gives us a result table named as running_totals_by_route which shows us how many people traveled from one airport to another on a day, and up to a given (inclusive) day. Below, we select just one route to see that I'm not fibbing
We did it!! In the 20 days of data we have, we can see the count of passengers each day, and the incrementing sum as days pass on this route
How can we make this more robust?
While our queries above work great, this is just a sample dataset. You may want to account for robustness and edge cases when you move from a proof of concept to production queries.
Let's improve this for a wider time series
The below changes demonstrate adding a month-year dimension to account for a bigger dataset, so that our grouping and cumulative sums are accurate and restart at each new month date <cue Bone Thugs N Harmony - 1st of the month>
Further Learning
I hope you have enjoyed learning about this window function. To productionize your data queries, you may use the WINDOW function to define a specific window through which to apply aggregations for your rows. This function works with MySQL databases and in BigQuery. There are other ways to aggregate information for both numerical and string values. This was an introduction, so here are some more resources to expand your learning!
Resources
For further learning, here are some of my favorite resources. Happy adventures in SQL 🥰