Framing Aggregates with WINDOW Functions

Welcome! This session assumes an intermediate understanding and usage of SQL

WINDOW functions in SQL allows to change the lens of the same data we are querying by adding context to the values being used. By design, window functions help you aggregate numbers or metadata so that each data row also contains general information related to the attributes

What we're doing today

By the end of today, you will have become familiar with two window functions in SQL that allow you to group and aggregate values

OVER, PARTITION BY, ORDER BY

Aggregation of numerical values

You will have the chance to follow along

Go to deepnote.com

Follow this video to create a project: https://www.youtube.com/watch?v=w55jI5NPw6M

Go into the sql-training-session-202310 directory of my GitHub training repo and download the following files: window_functions_session.ipynb and airport_data.csv

Upload the two files into the Deepnote project you created

Open the notebook and press Run Notebook on the top right section to load your virtual environment

Let's Get Into It

Loading and Previewing Data

Before we do anything to our data, let's see and understand it! By the end of this, we will know what our data table looks like, what each column represents, and what the smallest grain is i.e. what combination of attributes makes a row unique

We have a CSV file of airport travel for the month of October 2023 in local US airports

We will make the assumptions that our data rows are unique and the departure and destination airport combination never equals each other i.e. you can't have the same departure and destination airport for a passenger

-- checking for a missing value in dates, you can do this with other fields as well SELECT SUM(IF(travel_date IS NULL, 1, 0)) AS travel_date_nulls FROM 'airport_data.csv' ;

-- Select data so we can save it as a variable called airport_traffic SELECT * FROM 'airport_data.csv' ;

Notice we have 135,369 rows in our table. This is going to be important in understanding the difference between a regular aggregation query and a window function aggregation query. Let's begin!

Now, our functions

OVER - while selecting a specific column, you can group things using PARTITION BY, and order things using ORDER BY

Let's try it!

ORDER BY - How would we track the volume of passengers changes each day by departure airport?

There are three parts of grouping to consider. We want

Count of passengers = we need to use the COUNT function

For each day = means we need to group our count by day

By departure airport = means we also need a grouping that breaks out departure too

First, the typical grouping usage would look something like this

SELECT travel_date, departure_airport, -- our aggregation function COUNT(passenger_name) AS passenger_ct FROM airport_traffic GROUP BY 1, 2 ORDER BY 1, 2

Remember the number of rows? In an aggregation, we lose grains -> we've gone from 135,369 rows to 220 rows and can't select the details of each passenger and route on each day. Keep this fact in your back pocket for now. Let's stay at this aggregate table and assume we got this as our starting dataset - just a count of departure totals.

Now what if we wanted to get the change over time? In the above case, you may create a new column after downloading this. Maybe in Excel with a pivot table. But what if we could get the cumulative number of customers each day with SQL?! Here come our window functions!

WITH cumulative_sum AS ( SELECT travel_date, departure_airport, passenger_ct AS passengers_today, -- our aggregation function with grouping by departure SUM(passenger_ct) -- partition by defines the groups of things we want to sum together -- order by indicates the sequence of how we sum things up. Default is ASC ordering OVER(PARTITION BY departure_airport ORDER BY travel_date) AS passenger_running_total FROM daily_passenger_departures ) SELECT * FROM cumulative_sum

What did we just do?

We added a new column that cumulatively adds up the total departures out of every airport on our list with each passing day. The preview here shows you that our data is ordered by date on the first column, and with each additional day, the total increases

We still have 220 rows! Unlike last time, we have gone from 220 to ... 220 😂 We added more information without compromising on what we had before (again I asked you to assume the aggregate table as our starting point for a moment). So you could say, we opened a window and now ... on the same plain of view, we can see more things!!

The PARTITION BY clause sets up the grouping of our sum i.e. we are grouping total departures by airport

The ORDER BY clause allows us to apply the sum to each additional day in ascending order. Ascending order is implicit here (with SQL, it is the default. To set the opposite, you would add DESC to the ordering field)

Note: using ORDER BY alone doesn't give you the sum by departure. It would just sum each day regardless. Therefore we can't skip the partitioning part!

You could show more information in this example by adding the destination field to your selection. This does not change the math, it extends the scope of your row. Remember - we are not hiding information with windows, we are improving our vantage point.

Putting it all together

Let's combine the two previous steps into one. We now want to start from our original table of 135,369 rows, and select two values along with each row

A count of the departure airport passengers on each day

A cumulative sum of the departure airport passengers with each passing day

-- first query where we count our groupings by departure and for the day using window function SELECT travel_date, passenger_name, passenger_age, departure_airport, destination_airport, -- include the grouping for passenger count by day for the departure COUNT(1) OVER(PARTITION BY travel_date, departure_airport ORDER BY travel_date) AS passenger_departures_ct, -- include the cumulative grouping to increment the passenger count by day SUM(1) OVER(PARTITION BY departure_airport ORDER BY travel_date) AS passenger_departures_running_ct FROM airport_traffic

Notice: you don't have to use a GROUP BY function with the aggregation above because you are not reducing rows (we are still at 135,369 rows). You are adding more information to each row. Hence the window opening 😆 That means, if you were looking at any single passenger's itinerary on a given day, you will also be able to see how many passengers total left that same passenger's departure airport (regardless of destination) on that day, and how many have departed so far up to and including that day.

‼️ Once again, be mindful to add your fields in the order you want them to be partitioned and ordered.

Now let's select one departure so we can see how the count changes with each day to make sure we got what we wanted

Notice how because I saved my query result above as a variable called running_totals_by_departure, I can now use it as my table selection (thank you, Deepnote ✨)

-- you can check to see your work for a specific airport SELECT -- window functions are not meant to aggregate rows to the point where you don't see the details of them -- it is meant for you to get the aggregate value within the smallest grain -- note: distinct in production is discouraged, this is to show an example of one departure locale with the running totals DISTINCT travel_date, departure_airport, passenger_departures_ct, passenger_departures_running_ct FROM running_totals_by_departure WHERE departure_airport = 'DET'

We can see for each of our 20 days in October, the count for each day's departure from Detroit, and the running total as the days increment!

Let's try this again with the route (departure - destination combination)

The PARTITION BY clause is extendable by adding more variables in the order with which you want to partition AND order. Now we can group by not just the departure, but also the destination AND keep accumulating by the day. This is a smaller slice to aggregate, but again - we are not changing the number of rows, we are just adding more information to it.

-- first query where we count our groupings by departure x destination SELECT travel_date, passenger_name, passenger_age, departure_airport, destination_airport, -- include the grouping for passenger count by day for the departure x destination COUNT(1) OVER(PARTITION BY travel_date, departure_airport, destination_airport ORDER BY travel_date) AS passenger_route_today_ct, -- include the cumulative grouping to increment the passenger count by day SUM(1) OVER(PARTITION BY departure_airport, destination_airport ORDER BY travel_date) AS passenger_route_running_month_ct FROM airport_traffic

Our selection gives us a result table named as running_totals_by_route which shows us how many people traveled from one airport to another on a day, and up to a given (inclusive) day. Below, we select just one route to see that I'm not fibbing

-- check for this cumulative sum and daily count for one route SELECT -- note: distinct in production is discouraged, this is to show an example of one departure/destination locale with the running totals DISTINCT travel_date, departure_airport, destination_airport, passenger_route_today_ct, passenger_route_running_month_ct FROM running_totals_by_route WHERE departure_airport = 'DCA' AND destination_airport = 'MSP'

We did it!! In the 20 days of data we have, we can see the count of passengers each day, and the incrementing sum as days pass on this route

How can we make this more robust?

While our queries above work great, this is just a sample dataset. You may want to account for robustness and edge cases when you move from a proof of concept to production queries.

Dates: our data only has 20 days worth of data for October 2023, so our running month total doesn't have to check the date

Naming is important - fields and subqueries should have names that other members of your team (and more importantly, your data users) understand

You could have a WINDOW defined for each selection so you can create table indices and views from the same source table

Let's improve this for a wider time series

The below changes demonstrate adding a month-year dimension to account for a bigger dataset, so that our grouping and cumulative sums are accurate and restart at each new month date <cue Bone Thugs N Harmony - 1st of the month>

-- let's account for the date. what if we want the monthly running totals and we have more than October? WITH travel_month_dataset AS ( -- same query as before; let's put this in a CTE -- first query we set up our travel month variable which we will use in the groupings in our window function SELECT travel_date, -- adding a new dimension that has month and year only for use in partitioning later EXTRACT(YEAR FROM travel_date) || '-' || EXTRACT(MONTH FROM travel_date) AS travel_month, passenger_name, departure_airport, destination_airport FROM airport_traffic ), cumulative_route_ct AS ( SELECT *, -- include the grouping for passenger count by day, month, and the departure x destination COUNT(1) OVER( PARTITION BY travel_date, travel_month, departure_airport, destination_airport ORDER BY travel_date, travel_month ) AS passenger_routes_ct, -- include the cumulative grouping to increment the passenger count by day and month SUM(1) OVER( PARTITION BY travel_month, departure_airport, destination_airport ORDER BY travel_date, travel_month ) AS passenger_routes_running_ct FROM travel_month_dataset ) -- let's check a specific route. SELECT * FROM cumulative_route_ct

-- to check for one route, use the query below -- it will look the same as before, but if we had multiple months, the cumulative values would reset each month SELECT DISTINCT travel_date, departure_airport, destination_airport, passenger_routes_ct, passenger_routes_running_ct FROM inter_month_route_running_totals WHERE departure_airport = 'DCA' AND destination_airport = 'MSP'

Further Learning

I hope you have enjoyed learning about this window function. To productionize your data queries, you may use the WINDOW function to define a specific window through which to apply aggregations for your rows. This function works with MySQL databases and in BigQuery. There are other ways to aggregate information for both numerical and string values. This was an introduction, so here are some more resources to expand your learning!

-- Question asked during live session: How would one use the WINDOW function in practice? -- pseudo code example of what a WINDOW function usage would look like SELECT col_1, col_2, SUM(col_3) OVER(item_window) AS col_3, FIRST_VALUE(col_5) OVER(item_window) AS col_5 FROM tbl WINDOW item_window AS (PARTITION BY col_1, col_2 ORDER BY col_4_ts DESC)

Resources

For further learning, here are some of my favorite resources. Happy adventures in SQL 🥰

Mode Analytics SQL: https://mode.com/sql-tutorial/

MySQL Document Reference: https://dev.mysql.com/doc/refman/8.0/en/window-functions-usage.html

BigQuery Document Reference for navigation (string and numerical value aggregation): https://cloud.google.com/bigquery/docs/reference/standard-sql/navigation_functions

BigQuery Document Reference for WINDOW function syntax: https://cloud.google.com/bigquery/docs/reference/standard-sql/window-function-calls#syntax

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Framing Aggregates with WINDOW Functions