Improving performance by avoiding LIMIT in large data sets

This is a repost of http://www.king-foo.com/2015/09/improving-performance-by-avoiding-limit-in-large-data-sets/.

Whether you are using MySQL, MongoDB or another data store, you will have some kind of paging mechanism to scroll through your data set.

This mostly boils down to:

Order the data set in a way;
Skip the first X records (the offset);
Select the X following records (the limit);

SELECT * FROM users LIMIT 14500,500;

You get the impression that only a slice of the data set is used/returned, but on the server side, a lot of work has been done.

This works quite well for the beginning of the data set, but performance will quickly drop once you start to skip 100.000 rows and select the next 100 records. For paginating search results this is acceptable, because most users won’t get to page 6, but for iterating over a complete data set, this is disastrous.

The solution

Instead of using paging, you can implement some kind of scrolling or cursoring.

If you look at Twitter, they use a max_id to enable this and Elasticsearch has a scan and scroll mechanism for this.

Now you have two options to implement this for your local database:

Completely do the selection with a WHERE clause;
Or combine a WHERE clause with a LIMIT;

WHERE clause

The easiest way to scroll through your data set, is to call the MAX(id), and fetch pages until you reach that max id.

SELECT MAX(id) FROM users;
SELECT * FROM users WHERE id >= 0 AND id < 500;
...
SELECT * FROM users WHERE id >= 14500 AND id < 15000;
...

Using this approach, you can only stop fetching once you reach the max id, because some of the pages can be empty..

This approach does not need an ORDER clause, because you set the range you want back.

WHERE clause combined with LIMIT

This approach can be used to display pagination on a website, because you cannot have empty pages.

SELECT COUNT(*) FROM users; --- Optional, but needed if you want to show how many pages you have
SELECT * FROM users LIMIT 500 ORDER BY id; --- Last ID is here 543
SELECT * FROM users WHERE id > 543 ORDER BY id LIMIT 500 ORDER BY id; --- Last ID is here 2003
SELECT * FROM users WHERE id > 2003 ORDER BY id LIMIT 500;
...

The drawback here is that you have to sort the data set, but this is still much quicker that the OFFSET,LIMIT approach.