How to find the top / bottom "N" rows in each group?
This question is more commonly seen that you think:
- Out of a list of twitter users, find their N recent most tweets
- Out of all our salespeople, find the top N performing ones by closed business in each region
- Out of all the cities, find the top N cities in each country by population
Conside a table cities with three columns country, city and population:
United States | New York | 8175133 |
United States | Los Angeles | 3792621 |
United States | Chicago | 2695598 |
France | Paris | 2181000 |
France | Marseille | 808000 |
France | Lyon | 422000 |
United Kingdom | London | 7825300 |
United Kingdom | Birmingham | 1016800 |
United Kingdom | Leeds | 770800 |
Top N without Grouping
We have 9 cities in our table, let's select the top 3 without grouping.
United States | New York | 8175133 |
United Kingdom | London | 7825300 |
United States | Los Angeles | 3792621 |
Top N with Grouping with row_number[]
Fortunately, we have access to the window function row_number[] that numbers the current row within its partition [group], counting at 1.
When we don't specify any partition:
United States | New York | 8175133 | 1 |
United Kingdom | London | 7825300 | 2 |
United States | Los Angeles | 3792621 | 3 |
United States | Chicago | 2695598 | 4 |
France | Paris | 2181000 | 5 |
United Kingdom | Birmingham | 1016800 | 6 |
France | Marseille | 808000 | 7 |
United Kingdom | Leeds | 770800 | 8 |
France | Lyon | 422000 | 9 |
When we create a partition for each country:
select country, city, population, row_number[] over [partition by country order by population desc] as country_rank from cities;France | Paris | 2181000 | 1 |
France | Marseille | 808000 | 2 |
France | Lyon | 422000 | 3 |
United Kingdom | London | 7825300 | 1 |
United Kingdom | Birmingham | 1016800 | 2 |
United Kingdom | Leeds | 770800 | 3 |
United States | New York | 8175133 | 1 |
United States | Los Angeles | 3792621 | 2 |
United States | Chicago | 2695598 | 3 |
Almost there!
Now to select the top 2 cities from each country, we wrap the above query in another query and filter the row on the country_rank column.
select * from [ select country, city, population, row_number[] over [partition by country order by population desc] as country_rank from cities] ranks where country_rank