I believe this is a huge optimization for a heavily implemented Web 2.0 idea.
This article makes simple work of groups of queues. An example of this would be “the most recent 10 people to view an article,” so each article has a queue of up to 10 items in it. This method eliminates the need for multiple SQL statements or using TRIGGERS to check to see if the queue is full.
I bow down to Baron Schwartz, aka Xarpb, for his article on how to implement a queue in SQL:
http://www.xaprb.com/blog/2007/01/11/how-to-implement-a-queue-in-sql/
I am very excited because this also works for groups of objects, and we’re about to implement something at work that needs this idea. The idea of “the most recent x things” or “the top x things” is huge, especially in social networking, and probably one of the most often sought after features.
The biggest issue is that in order to display, say, the most recent posts, a query has to find the time of all the posts and only get the most recent 10. This can be made easy by the logic that the 10 most recent posts are the last 10 rows in the table. Any logic is also added, as in “the last 10 rows in the table viewable and for this guest/login.”
What if you want to track the last 10 people to view the post? Aha, this gets trickier. Convention would say that when a person views a post, have an SQL transaction that adds the information (person x viewed post y at time z and anyo other info, such as browser type, IP, etc) and if there are more than 10 entries for that post, delete the oldest ones until you have 10 entries. This transaction could be done via the application code or via triggers in MySQL 5.0 and up.
However, both those methods use multiple SQL queries, and in the case that an article has been viewed fewer than 10 times, the queries are unnecessary. And given each article has a different popularity — some are viewed lots more than others — running multiple queries ends up being a waste of cycles for articles whose last 10 viewers change infrequently.
These commands were tested on MySQL 4.1.19-standard-log. I use REPLACE INTO because it’s shorter than SELECT…ON DUPLICATE KEY UPDATE, and yes, those aren’t
Let’s say you have a New Year’s Resolution to eat 5 servings of fruits and 5 servings of vegetables per day. The only thing that changes from Baron’s example is that we add a group field (called ‘kind’). The “fruit” field was changed to “edible” and will still contain the name of the edible.
As Baron does, I will use a MySQL-specific command. However, he used SELECT...ON DUPLICATE KEY
and I will use REPLACE
, as it is smaller in syntax.
use test;
CREATE TABLE q (
id int NOT NULL,
modulo int NOT NULL,
kind char(1) NOT NULL,
food varchar(10) NOT NULL,
PRIMARY KEY(id,kind),
UNIQUE KEY(modulo,kind)
);
The basic statement is below — I’ve added AS
clauses to make the variables more clear. The modulus is, in this case, 5, but in the article case above would be 10. The “kind” is either “f” or “v”, these are your groups of queues. In this case they stand for “fruits” and “vegetables” but they might be numbers referring to articles. The “food” stands for the type of food eaten, but in the article scenario would represent the username or user id of the customer viewing the article.
REPLACE INTO q (id, modulo, kind, food)
SELECT
(COALESCE(MAX(id), -1) + 1) AS id,
(COALESCE(MAX(id), -1) + 1) MOD 5 AS modulo,
'f' AS kind,
'apple' AS food
FROM q WHERE kind='f';
mysql> SELECT * FROM q order by kind,id;
id | modulo | kind | food |
0 | 0 | f | apple |
As expected, 1 "fruit" row.
mysql> REPLACE INTO q(id, modulo, kind, food)
-> SELECT
-> (COALESCE(MAX(id), -1) + 1),
-> (COALESCE(MAX(id), -1) + 1) MOD 5,
-> 'f',
-> 'orange'
-> FROM q WHERE kind='f';
Query OK, 1 row affected (0.00 sec)
Records: 1 Duplicates: 0 Warnings: 0
mysql> SELECT * FROM q order by kind,id;
id | modulo | kind | food |
0 | 0 | f | apple |
1 | 1 | f | orange |
As expected, 2 "fruit" rows.
mysql> REPLACE INTO q(id, modulo, kind, food)
-> SELECT
-> (COALESCE(MAX(id), -1) + 1),
-> (COALESCE(MAX(id), -1) + 1) MOD 5,
-> 'v',
-> 'okra'
-> FROM q WHERE kind='v';
Query OK, 1 row affected (0.00 sec)
Records: 1 Duplicates: 0 Warnings: 0
mysql> SELECT * FROM q order by kind,id;
id | modulo | kind | food |
0 | 0 | f | apple |
1 | 1 | f | orange |
0 | 0 | v | okra |
As expected, 2 "fruit" rows and 1 "vegetable" row. Now, let's quickly populate the fields so the "fruit" group reaches it's maximum of 5.
REPLACE INTO q(id, modulo, kind, food)
SELECT
(COALESCE(MAX(id), -1) + 1),
(COALESCE(MAX(id), -1) + 1) MOD 5,
'v',
'squash'
FROM q WHERE kind='v';
REPLACE INTO q(id, modulo, kind, food)
SELECT
(COALESCE(MAX(id), -1) + 1),
(COALESCE(MAX(id), -1) + 1) MOD 5,
'f',
'peach'
FROM q WHERE kind='f';
REPLACE INTO q(id, modulo, kind, food)
SELECT
(COALESCE(MAX(id), -1) + 1),
(COALESCE(MAX(id), -1) + 1) MOD 5,
'f',
'cherries'
FROM q WHERE kind='f';
REPLACE INTO q(id, modulo, kind, food)
SELECT
(COALESCE(MAX(id), -1) + 1),
(COALESCE(MAX(id), -1) + 1) MOD 5,
'f',
'pear'
FROM q WHERE kind='f';
REPLACE INTO q(id, modulo, kind, food)
SELECT
(COALESCE(MAX(id), -1) + 1),
(COALESCE(MAX(id), -1) + 1) MOD 5,
'v',
'celery'
FROM q WHERE kind='v';
SELECT * FROM q order by kind,id;
id | modulo | kind | food |
0 | 0 | f | apple |
1 | 1 | f | orange |
2 | 2 | f | peach |
3 | 3 | f | cherries |
4 | 4 | f | pear |
0 | 0 | v | okra |
1 | 1 | v | squash |
2 | 2 | v | celery |
We have 5 values in the “fruit” group and 3 values in the “veggie” group. Now let’s see what happens when another fruit is added:
REPLACE INTO q(id, modulo, kind, food)
SELECT
(COALESCE(MAX(id), -1) + 1),
(COALESCE(MAX(id), -1) + 1) MOD 5,
'f',
'banana'
FROM q WHERE kind='f';
Query OK, 2 rows affected (0.00 sec)
Records: 1 Duplicates: 1 Warnings: 0
Note that a duplicate has been found! This is because the modulo wrapped around. The id of “banana” is 5, and 5 modulo 5 = 0
– the same as 0 modulo 5
, which was the modulo value previously taken by “apple”. So “apple” is pushed off the end of the queue.
SELECT * FROM q order by kind,id;
id | modulo | kind | food |
1 | 1 | f | orange |
2 | 2 | f | peach |
3 | 3 | f | cherries |
4 | 4 | f | pear |
0 | 5 | f | banana |
0 | 0 | v | okra |
1 | 1 | v | squash |
2 | 2 | v | celery |
To find the current list of all fruits, with the most recent fruit first, run:
SELECT * FROM q WHERE kind='f' ORDER BY id DESC;
id | modulo | kind | food |
1 | 1 | f | orange |
2 | 2 | f | peach |
3 | 3 | f | cherries |
4 | 4 | f | pear |
0 | 5 | f | banana |
Let’s get back to the example of page views, though. We probably care about when the pages were viewed, so let’s add a timestamp:
ALTER TABLE q ADD COLUMN fed TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP;
I ran the queries again, with some delays, so the timestamps wouldn’t all be the same.
SELECT * FROM q order by id,kind;
id | modulo | kind | food | fed |
1 | 1 | f | orange | 2007-01-15 14:48:25 |
2 | 2 | f | peach | 2007-01-15 14:48:28 |
3 | 3 | f | cherries | 2007-01-15 14:48:28 |
4 | 4 | f | pear | 2007-01-15 14:48:31 |
5 | 0 | f | banana | 2007-01-15 14:48:34 |
1 | 1 | v | squash | 2007-01-15 14:48:28 |
2 | 2 | v | celery | 2007-01-15 14:48:31 |
3 | 3 | v | beet | 2007-01-15 14:48:31 |
4 | 4 | v | spinach | 2007-01-15 14:48:34 |
5 | 0 | v | cucumber | 2007-01-15 14:48:34 |
Or, what the query would be in a real system — find all fruits eaten and sort by time, most recent first:
SELECT food,fed FROM q WHERE kind=’f’ ORDER BY fed DESC;
banana | 2007-01-15 14:48:34 |
pear | 2007-01-15 14:48:31 |
peach | 2007-01-15 14:48:28 |
cherries | 2007-01-15 14:48:28 |
orange | 2007-01-15 14:48:25 |
edit:
One edge case to be aware of — reaching the limit of the id field.
If your application does 100 of these per second, an UNSIGNED INT will last 1.36 years — not so great. You should use an UNSIGNED INT because you’re never going to have a negative number, and a signed int would only last just over 8 months if there were 100 REPLACEs per second.
(60 seconds/min, 60 min/hr, 24 hrs/day, 365 days/yr)
4294967295/60/60/24/365/100=1.3619251950
So, for scaling/known high traffic, use a BIGINT. However, in this case, do NOT use UNSIGNED, as all MySQL arithmetic is done with signed bigints or doubles. Not that it matters in this case; at 100 REPLACEs per second, you will wrap at 2.9 billion years:
9223372036854775807/60/60/24/365/100=2,924,712,086.7753601074
Let’s say your system does 10,000 of these REPLACEs per second, for eternity (our main database system, where we’re about to use this, average 6,000 qps, not all writes, but it’s a good figure to use for our own numbers) — move the decimal places a few spots over and you’re down to running out of numbers in 29 million years.
That’s an OK limit for us. 🙂