Tutorial: SQL Analytic Functions



SQL analytic ("window") functions compute a value for each row of input based on other rows, collectively known as a "window". The window is defined using OVER and can be refined using PARTITION BY, ORDER BY, and ROWS/RANGE BETWEEN. The value can be from a numbering function, navigational function, or an aggregate analytic function.

To learn more, check out my Youtube video. For a text reference, BigQuery's article "Analytic function concepts in Standard SQL" is a great resource. Below, you can find the text referenced in the Youtube video, as well as some bonus content not in the video.

Types of analytic functions (naming comes from the BigQuery article):
  • Numbering functions - assign a number to the given row relative to other rows in the partition. e.g. RANK(), ROW_NUMBER(), DENSE_RANK().
  • Navigational functions - locate a row relative to the given row. e.g. LEAD(other row's column), LAG, FIRST_VALUE, LAST_VALUE.
  • Aggregate analytic functions - aggregate on the given row's window. e.g. SUM(), COUNT().

Window keywords:
  • OVER - run the function over a window defined as:
    • PARTITION BY - divide the input into sets of related rows based on shared column value(s). Only include neighbors of the given row's set in the window.
    • ORDER BY - order the rows in the set.
    • ROWS/RANGE BETWEEN - adjust the size of the window.
      • UNBOUNDED - go to the very beginning/end
      • PRECEDING - before the given row
      • FOLLOWING - after the given row
      • ROWS - number of rows
      • RANGE - amount from the column listed in ORDER BY

Youtube Text Companion

To run the queries for yourself, see "Testing Setup" below.

Base Table

For the following examples, we will use this table. I've provided the corresponding SQL code below.

Name Age Bucket Weight
Alex Adult 190
Brooke Adult 132
Casey Child 87
Danny Child 55
Ellen Child 32
Fanny Child 55
CREATE TEMP TABLE Weights AS (
  SELECT 'Alex' as name, 'adult' as age_bucket, 'boy' as gender, 190 as weight
  UNION ALL SELECT 'Brooke', 'adult', 'girl', 132
  UNION ALL SELECT 'Casey', 'child', 'boy', 87
  UNION ALL SELECT 'Danny', 'child', 'girl', 55
  UNION ALL SELECT 'Ellen', 'child', 'boy', 32
  UNION ALL SELECT 'Fanny', 'child', 'boy', 55
);

Example 1: Numbering Functions (Rank)

Numbering functions assign a number to each row within its partition.

In the example below, we partition the input into two sets ("Adult" rows and "Child" rows), then order the rows by weight within each set. Finally, we call the numbering function on each row, informed by the window the row belongs to.

RANK() skips numbers if there are ties. DENSE_RANK() does not. ROW_NUMBER() simply assigns a number to each row in ascending order regardless of ties.

Window sizing keywords (ROWS/RANGE BETWEEN) cannot be used for numbering functions.

An analytic function call can be treated as a value. You can give it an alias. After FROM, you can ORDER BY and LIMIT the resulting rows.

NameAge BucketWeightRANK()ROW_NUMBER()DENSE_RANK()
AlexAdult190111
BrookeAdult132222
CaseyChild87111
DannyChild55222
EllenChild32443
FannyChild55232

SELECT
  *,
  RANK() OVER (PARTITION BY age_bucket ORDER BY weight DESC),
  ROW_NUMBER() OVER (PARTITION BY age_bucket ORDER BY weight DESC),
  DENSE_RANK() OVER (PARTITION BY age_bucket ORDER BY weight DESC)
FROM Weights
ORDER BY name;

Example 2: Aggregate Analytic Functions (Cumulative Sum)

A cumulative sum at each row is the row's weight and the previous rows. After partitioning by age bucket, we can express this "cumulative" property by defining a window from the start to the current row, then computing the sum.

Note how we order by weight descending before applying SUM() on the window. Casey is first at 87. Danny's window includes Casey and Danny himself, so SUM(Casey, Danny) is 87+55. Ellen is last, so the sum of Ellen's window is 87+55+55+32.

NameAge BucketWeightCumulative Sum
AlexAdult190190
BrookeAdult132322
CaseyChild8787
DannyChild55142
EllenChild32229
FannyChild55197

SELECT
  *,
  SUM(weight) OVER (
    PARTITION BY age_bucket
    ORDER BY weight DESC
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  )
FROM Weights;
Alternatively, you can write ROWS UNBOUNDED PRECEDING, which is shorthand for ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.
SELECT
  *,
  SUM(weight) OVER (
    PARTITION BY age_bucket
    ORDER BY weight DESC
    ROWS UNBOUNDED PRECEDING
  )
FROM Weights;

Example 3: Rows Between (Moving Average)

Here, we explore ROWS BETWEEN further. Also note that PARTITION BY is optional if you don't want to split your input into separate sets. Imagine the pain of using joins and group by to get the same information!

NameAge BucketWeightMoving Average No Partition
AlexAdult190161
BrookeAdult132136.3
CaseyChild8791.3
DannyChild5558
EllenChild3247.3
FannyChild5543.5
SELECT
  *,
  AVG(weight) OVER (
    ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
  )
FROM Weights;

Example 4: Range Between

Instead of counting physical rows, you can define a window by a particular numeric value using RANGE BETWEEN. Danny's window is [55-30, 55+25] = [25, 80]. Ellen and Fanny are in that window.

NameAge BucketWeight# within 30
AlexAdult1901
BrookeAdult1321
CaseyChild871
DannyChild553
EllenChild323
FannyChild553
SELECT
  *,
  COUNT(1) OVER (
    PARTITION BY age_bucket
    ORDER BY weight
    RANGE BETWEEN 30 PRECEDING AND 25 FOLLOWING
  )
FROM Weights;

Example 5: Navigation Function (Lead)

What if you want to examine the value of a particular neighbor in a row's window? That's where navigation functions come in. The four basic ones are FIRST_VALUE, LAST_VALUE, LEAD, and LAG. These functions require you to provide ORDER BY.

In this example, we partition by age bucket and order each set by weight ascending and choose LEAD. LEAD gives you the next row in the window. After you call lead, you should provide the column to extract from that row. We extract name here.

The order of the adult set is [Brooke: 132, Alex: 190]. Brooke's lead is Alex. Alex does not have a lead, because Alex is ordered last in the set.

NameAge BucketWeightHeavier in Age Bucket
AlexAdult190---
BrookeAdult132Alex
CaseyChild87---
DannyChild55Fanny
EllenChild32Danny
FannyChild55Casey
SELECT
  *,
  LEAD(name) OVER (
    PARTITION BY age_bucket
    ORDER BY weight
  )
FROM Weights;

Bonus Content (not in video)

Analytic Functions vs. Group By

GROUP BY defines groups, then computes functions on those groups. The end result is one row per group.

On the other hand, analytic functions compute a value for each individual row in the input. Of course, analytic functions are more meaningful when you define a window.
SELECT
  age_bucket,
  SUM(weight)
FROM Weights
GROUP BY 1;

Age BucketWeight
Adult322
Child229
SELECT
  name,
  age_bucket,
  SUM(weight) OVER (PARTITION BY age_bucket) AS total
FROM Weights;
NameAge BucketTotal
AlexAdult322
BrookeAdult322
CaseyChild229
DannyChild229
EllenChild229
FannyChild229

Analytic Functions and the WHERE Clause

WHERE applies before analytic function are computed. Compare the following:
SELECT
  name,
  age_bucket,
  SUM(weight) OVER (PARTITION BY age_bucket) AS total
FROM Weights;
NameAge BucketTotal
AlexAdult322
BrookeAdult322
CaseyChild229
DannyChild229
EllenChild229
FannyChild229

SELECT
  name,
  age_bucket,
  SUM(weight) OVER (PARTITION BY age_bucket) AS total
FROM Weights
WHERE name LIKE '%y';
NameAge BucketTotal
CaseyChild197
DannyChild197
FannyChild197

Analytic Functions and Subqueries

Analytic functions can be called with a FROM clause that is a subquery. We can rewrite the query in the previous section as:
SELECT
  name,
  age_bucket,
  SUM(weight) OVER (PARTITION BY age_bucket) AS total
FROM (SELECT * FROM Weights WHERE name LIKE '%y');
NameAge BucketTotal
CaseyChild197
DannyChild197
FannyChild197

Real-World Application: CTR Below Your Feature

Let's say you're the head editor at Best News Corp. Your website is a simple list of stories, and you want to see how well your fashion stories are performing relative to other stories. One useful measure is to see whether users click on anything below the highest fashon story on the page.

SELECT
  category,
  RANK() OVER (PARTITION BY category ORDER BY position) AS category_rank,
  SUM(interaction) OVER (
      ORDER BY position
      ROWS BETWEEN 1 FOLLOWING AND UNBOUND FOLLOWING
  ) > 0 AS clickedSomethingBelow
FROM page_views
WHERE category = 'Fashion' AND category_rank = 1;

Conclusion

Thank you for making it this far. Again, there are many great write-ups of SQL analytic functions elsewhere, but these examples should help you understand them better. See you in the next tutorial!

Testing Setup

I use Google Cloud's BigQuery to run SQL queries. BigQuery offers some public datasets, but you can also load hardcoded tables, such as the example below, and execute queries on them for free.



Comments

Popular posts from this blog

Learning About My Brain From Typing Z-A

Lessons from Teaching My SAT Reading Series at schoolhouse.world