In my data travels there are a couple SQL functions which have come in so singularly handily that I've decided to dedicate a blog post to them. Those functions are RANK and DENSE_RANK. In this post I'm going to talk about one particular class of use cases where these functions can come to the rescue.
You can find a detailed definition of these functions all over the place (here, for example: https://docs.microsoft.com/en-us/sql/t-sql/functions/dense-rank-transact-sql?view=sql-server-ver15). Here I'll just give a brief summary.
Usage of DENSE_RANK rank takes the following form:
DENSE_RANK ( ) OVER ( [ <partition_by_clause> ] < order_by_clause > )
Given a result set, a partition is applied which divides the results into distinct chunks. An order by clause then provides an ordering within each partition. The function's induced column of data displays a number indicating the relative position in the ordering within a given partition.
In my experience DENSE_RANK is more useful, but it's worth understanding both and their differences. When there are ties with respect to the ordering, both functions will produce a consistent single value for the same set of rows. When the "next" value in the orderings comes up, DENSE_RANK will give a value incremented by 1, while RANK will produce a value as though we've been incrementing throughout the "tied" rows. This is hard to grasp from a written explanation, so lets look at an example.
To see this in detail and understand the use case we're after, lets look at this example table (#Orders in later sample queries). In this table we have some sample sales data of orders to different customers. In this scenario, we're interested in measuring the number of days between various orders. A common need is to look for the maximum days between orders - i.e. the time between the first and last order for a customer.
In the query here we've joined two copies of the #Orders table dubbed ordStart and ordEnd, distinguished by the expressions ordStart.order_date < ordEnd.order_date. Watch out for a cartesian product type effect when doing this out in the wild, since we'll get a row for each pair of orders satisfying the ordStart.order_date < ordEnd.order_date inequality. To get the maximum date difference between orders for a given customer, we can use the datediff function as we've done here, and then use this query as a subquery that we summarize on, by customer, using the MAX function on the datediff(day, ordStart.order_date, ordEnd.order_date) 'days in between' expression. Super simple.
Now, suppose instead that you've been tasked with finding the number of days between the first or last (or even second, third, etc.) orders of specific types or categories of products. Remember that a product may be ordered multiple times by a given customer. All of a sudden, doing a simple MAX summary isn't going to be sufficient. This is where our handy rank functions come in.
Let's imagine a scenario where we want to size up the amount of time between the first order of a "small product" (let's say these are Spam, Coffee Beans, and Artificial Spleens) and the first order of a "big product" (let's call these Loafers, Personalized Greeting from Stanley Tucci, and Absolute Units, why not). Let's take our detail query from earlier and add some ranks to the output.
,rank() over (partition by ordStart.customer order by ordStart.order_id asc)'earlier rank'
,rank() over (partition by ordStart.customer order by ordEnd.order_id asc) 'later rank'
,dense_rank() over (partition by ordStart.customer order by ordStart.order_id asc) 'earlier dense rank'
,dense_rank() over (partition by ordStart.customer order by ordEnd.order_id asc) 'later dense rank'
If we take a look at the output for a specific customer - MacDolans for example - we can understand the difference between the RANK and DENSE_RANK functions. An order for Coffee Beans with order ID 432569 shows up in three rows as the earlier order. The RANK and DENSE_RANK of this column for this customer is 1 in all three rows. The difference comes in the very next row, where the DENSE_RANK for this order is 2 while the RANK is 4. RANK behaves like it's counting the number of rows in the ordering so far, while DENSE_RANK behaves like it's counting the number of distinct values.
There are a lot of rows we're not interested in for our scenario, so let's add the below to our where clause - this limits our results to just pairs where the earlier order is a small product and the later order is a big one.
and ordStart.product in ('Spam', 'Coffee Beans', 'Artificial Spleens') and ordEnd.product in ('Loafers', 'Personalized Greeting from Stanley Tucci', 'Absolute Units')
Finally, as we're looking for just the first small order to the first big order, we can just add one more line to get what we need.
where details.[earlier dense rank] = details.[later dense rank]
Now we've got pairs where both the earlier and later order are the nth small and big order respectively. In this example, the three of the pairs are cases where the earlier and later order are the first small and big order, with the Coffee Beans and Loafers pair of orders by MacDolans being a case where each was the second order in the desired category. If we wanted to just get the cases of the first order in each category, we could change our condition to restrict both DENSE_RANK values to be 1. If we were looking to measure the times between the last small & big order, we could summarize on the maximum DENSE_RANK of our results.
And that's it! Thanks for reading. SQL is included below for ease of copy & pasting in case you're interested.
--DROP TABLE #Orders
CREATE TABLE #Orders
(
order_id INT PRIMARY KEY,
product VARCHAR(50) NOT NULL,
customer VARCHAR(50) NOT NULL,
order_date DATETIME
)
INSERT INTO #Orders
Values
(124536, 'Personalized Greeting from Mark Strong', '9gag', '2012-05-29'),
(403698, 'Coffee Beans', '9gag', '2013-10-06'),
(624057, 'The Meaning of Liff', '9gag', '2020-01-14'),
(156247, 'Absolute Units', 'Business Corp', '2015-03-09'),
(156248, 'Personalized Greeting from Stanley Tucci', 'Business Corp', '2015-03-09'),
(659714, 'Spam', 'Business Corp', '2015-04-05'),
(659821, 'Loafers', 'Business Corp', '2018-01-17'),
(745215, 'Spam', 'Douglas', '2018-06-19'),
(589123, 'Spam', 'Douglas', '2018-02-19'),
(851736, 'Donkey Teeth', 'Douglas', '2018-12-14'),
(956841, 'Donkey Teeth', 'Douglas', '2019-10-11'),
(432569, 'Coffee Beans', 'MacDolans', '2013-07-05'),
(546289, 'Coffee Beans', 'MacDolans', '2014-06-01'),
(659421, 'Personalized Greeting from Stanley Tucci', 'MacDolans', '2019-04-23'),
(785496, 'Loafers', 'MacDolans', '2019-08-21'),
(145392, 'The Meaning of Liff', 'MacDondals', '2016-06-22'),
(659314, 'Absolute Units', 'MacDondals', '2017-08-08'),
(587263, 'Artificial Spleens', 'Margaret''s Company', '2013-09-10'),
(645103, 'Artificial Spleens', 'Margaret''s Company', '2015-09-20'),
(704823, 'Corn Chomps', 'Margaret''s Company', '2014-11-05'),
(985670, 'Spam', 'Margaret''s Company', '2016-02-28'),
(903214, 'Personalized Greeting from Stanley Tucci', 'Margaret''s Company', '2017-07-27');
select * from #Orders;
with details as
(selectordStart.order_id'earlier order'
,ordEnd.order_id'later order'
,ordStart.product'earlier product'
,ordEnd.product'later product'
,ordStart.customer
,ordStart.order_date'earlier order date'
,ordEnd.order_date'later order date'
,datediff(day, ordStart.order_date, ordEnd.order_date)'days in between'
from #Orders ordStart
inner join #Orders ordEnd on ordStart.customer = ordEnd.customer
where ordStart.order_date < ordEnd.order_date)
select
summary.customer
,summary.[max days in between orders]
,details.[earlier product]
,details.[later product]
,details.[earlier order date]
,details.[later order date]
from
(select
details.customer,
max([days in between])'max days in between orders'
from
details
group by
details.customer) summary
inner join details on (summary.customer = details.customer) and (summary.[max days in between orders] = details.[days in between])
order by summary.customer;
select * from
(select
ordStart.order_id'earlier order'
,ordEnd.order_id'later order'
,ordStart.product'earlier product'
,ordEnd.product'later product'
,ordStart.customer
,ordStart.order_date'earlier order date'
,ordEnd.order_date'later order date'
,rank() over (partition by ordStart.customer order by ordStart.order_id asc)'earlier rank'
,rank() over (partition by ordStart.customer order by ordEnd.order_id asc)'later rank'
,dense_rank() over (partition by ordStart.customer order by ordStart.order_id asc)'earlier dense rank'
,dense_rank() over (partition by ordStart.customer order by ordEnd.order_id asc)'later dense rank'
,datediff(day, ordStart.order_date, ordEnd.order_date)'days in between'
from #Orders ordStart
inner join #Orders ordEnd on ordStart.customer = ordEnd.customer
and ordStart.product in ('Spam', 'Coffee Beans', 'Artificial Spleens') and ordEnd.product in ('Loafers', 'Personalized Greeting from Stanley Tucci', 'Absolute Units')
where ordStart.order_date < ordEnd.order_date) details
where details.[earlier dense rank] = details.[later dense rank]
order by
details.customer
,details.[earlier rank]