Within-group aggregates with another wrinkle.

Recently when optimising an SQL query I needed to return a specific subset of records from a large table. The brilliant Common Queries Tree pointed me in the right direction, but the examples didn’t quite cover my specific case.

I needed the records with the maximal value of x and the minimal value of y for that x.

For example, suppose you had a table of second hand books and each book has a quality and price. There are many copies of each book but you’re only interested in the best examples of each book and at the lowest price for that quality.


CREATE TABLE Books( id int,
ISBN varchar(10),
Quality int,
price decimal(6,2)
);
INSERT INTO Books VALUES
( 1, '0618551050', 4, 5.00 ),
( 2, '0618551050', 4, 6.00 ),
( 3, '0618551050', 2, 3.00 ),
( 4, '0786884061', 1, 8.00 ),
( 5, '0786884061', 3, 8.00 );

Here we are only interested in records 1 and 5. The Common Queries Tree describes this as a within-group aggregates problem, and as it notes the correlated subquery is the most obvious solution but it was slow.

I started off with something like this
Select * from Books as b1
where b1.id = (SELECT id from Books where Books.ISBN = b1.ISBN order by Quality desc, Price limit 1);

But with almost a million records in the table, a select on each one was taking forever. After a bit of head scratching I managed to use a left self exclusion join to find the records I wanted.

select b1.* from Books as b1
left join Books as b2 on b1.ISBN = b2.ISBN and b1.Quality < b2.Quality left join Books as b3 on b1.ISBN = b3.ISBN and b1.Quality = b3.Quality and b1.Price > b3.Price
WHERE b2.id is NULL and b3.id is NULL;

Here the idea is to select as b1 the records from Books where its impossible to find a book, b2, that has a higher quality and impossible to find a book, b3 with the same quality as b1 but a lower price. If there was another book with a higher quality then b2.id would not be NULL so those records are excluded. Similarly if a book at the same quality had been found but with a lower price, b3.id would not be null.

This was much faster on my data set but it does produce a slightly different result when two items are indistinguishable.
insert into Books VALUES ( 6, '0786884061', 3, 8.00 );

With that my initial query would return either record 5 or 6 and the new query returns both. So group by b1.ISBN to restore the old behaviour. Once you’re happy, save your query into a view so you don’t have to look at it again.

Got a better solution ? If you can optimise my view you’ll make me very happy.

Alan.

This entry was posted in Geekery and tagged . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *