Top 10 MySQL Best Practices

So, O’Reilly’s ONLamp.com has published the “Top 10 MySQL Best Practices” at http://www.onlamp.com/pub/a/onlamp/2002/07/11/MySQLtips.html. Sadly, I find most “best practice” list do not thoroughly explain the “why” enough so that people can make their own decisions.

For instance, #3 is “Protect the MySQL installation directory from access by other users.” I was intrigued at what they would consider the “installation” directory. By reading the tip, they actually mean the data directory. They say nothing of the log directory, nor that innodb data files may be in different places than the standard myisam data directories.

They perpetuate a myth in #4, “Don’t store binary data in MySQL.” What they really mean is “don’t store large data in MySQL”, which they go into in the tip. While it’s true that there is very little benefit to having binary data in a database, they don’t go into what those benefits are. This means that people can’t make informed decisions, just “the best practice is this so I’m doing it.”

The benefit of putting binary data in MySQL is to be able to associate metadata and other data. For instance, “user 200 owns file 483”. If user 200 is gone from the system, how can you make sure file 483 is as well? There’s no referential integrity unless it’s in the database. While it’s true that in most cases people would rather sacrifice the referential integrity for things like faster database backups and easier partitioning of large data objects, I believe in giving people full disclosure so they can make their own informed decision.

#5 is my biggest pet peeve. “Stick to ANSI SQL,” with the goal being to be able to migrate to a different platform without having to rewrite the code. Does anyone tell Oracle folks not to use pl/sql like collections? Nobody says “SQL is a declarative language, pl/sql is procedural therefore you should never use it”. How about SQL Server folks not to use transact-sql statements like WAITFOR? MATCH… AGAINST is not standard SQL, so I should never use it?

Now, of course, if you’re selling a product to be run on different database platforms, then sure, you want to be platform agnostic. But you’d know that from the start. And if you have to migrate platforms you’re going to have to do lots of work anyway, because there are third-party additions to all the software any way.

And why would *anyone* choose a specific database, and then *not* use those features? I think that it’s a good tip to stick to ANSI SQL if you *know* you want to, or if you have no idea about the DBMS you’re using.

If you want to see how this cripples MySQL, check out Visibone’s SQL chart at: http://www.visibone.com/sql/chart_1200.jpg — you can buy it here: http://sheeri.com/archives/104. I may post later on about my own personal MySQL Best Practices….

So, O’Reilly’s ONLamp.com has published the “Top 10 MySQL Best Practices” at http://www.onlamp.com/pub/a/onlamp/2002/07/11/MySQLtips.html. Sadly, I find most “best practice” list do not thoroughly explain the “why” enough so that people can make their own decisions.

For instance, #3 is “Protect the MySQL installation directory from access by other users.” I was intrigued at what they would consider the “installation” directory. By reading the tip, they actually mean the data directory. They say nothing of the log directory, nor that innodb data files may be in different places than the standard myisam data directories.

They perpetuate a myth in #4, “Don’t store binary data in MySQL.” What they really mean is “don’t store large data in MySQL”, which they go into in the tip. While it’s true that there is very little benefit to having binary data in a database, they don’t go into what those benefits are. This means that people can’t make informed decisions, just “the best practice is this so I’m doing it.”

The benefit of putting binary data in MySQL is to be able to associate metadata and other data. For instance, “user 200 owns file 483”. If user 200 is gone from the system, how can you make sure file 483 is as well? There’s no referential integrity unless it’s in the database. While it’s true that in most cases people would rather sacrifice the referential integrity for things like faster database backups and easier partitioning of large data objects, I believe in giving people full disclosure so they can make their own informed decision.

#5 is my biggest pet peeve. “Stick to ANSI SQL,” with the goal being to be able to migrate to a different platform without having to rewrite the code. Does anyone tell Oracle folks not to use pl/sql like collections? Nobody says “SQL is a declarative language, pl/sql is procedural therefore you should never use it”. How about SQL Server folks not to use transact-sql statements like WAITFOR? MATCH… AGAINST is not standard SQL, so I should never use it?

Now, of course, if you’re selling a product to be run on different database platforms, then sure, you want to be platform agnostic. But you’d know that from the start. And if you have to migrate platforms you’re going to have to do lots of work anyway, because there are third-party additions to all the software any way.

And why would *anyone* choose a specific database, and then *not* use those features? I think that it’s a good tip to stick to ANSI SQL if you *know* you want to, or if you have no idea about the DBMS you’re using.

If you want to see how this cripples MySQL, check out Visibone’s SQL chart at: http://www.visibone.com/sql/chart_1200.jpg — you can buy it here: http://sheeri.com/archives/104. I may post later on about my own personal MySQL Best Practices….

Making Queries 45-90 Times Faster!!

aka…..”when good queries go bad!”

So, today the developers were debugging why a script was running much longer than expected. They were doing text database inserts, and got to the point where they realized that double the amount of text meant the queries took double the amount of time.

You see, they were doing similar text inserts over and over, instead of using connection pooling and/or batching them. Apparently the other DBA explained that it was a limitation of MySQL, but either the developers didn’t convey what they were doing well, or the DBA didn’t think to mention batching.

I ran a simple test on a test server. I used the commandline to connect to a db server on the same machine (even though in qa and production the db machine is on a different machine) just to make a point:

Queries per connect

Type Connects Queries Length of data transmitted Time
One-off 1000 1 619 bytes 12.232s
Single Connection 1 1000 604 kilobytes 0.268s
Batch 1 1 517 kilobytes 0.135s

So 1000 INSERTs using 1 connection is over 45 times faster than 1000 INSERTs using 1000 connections.
Using 1 batch INSERT statement is over 1.75 times faster than using 1 connection.
Using 1 batch INSERT statement is over 90 times faster than 1000 INSERTs using 1000 connections.

Note that while it’s faster to send a batch, if you don’t support sending 517 kilobytes to your database at once, you’ll want to break it up. That’s a small coding price to pay for 90x the database performance!!!

For reference, the formats used:
One-off:
INSERT INTO foo (col1, col2…) VALUES (val1, val2…);

Single Connection:
INSERT INTO foo (col1, col2…) VALUES (val1, val2…);
INSERT INTO foo (col1, col2…) VALUES (val1a, val2a…);

Batch: INSERT INTO foo (col1, col2…) VALUES (val1, val2…), (val1a, val2a);

aka…..”when good queries go bad!”

So, today the developers were debugging why a script was running much longer than expected. They were doing text database inserts, and got to the point where they realized that double the amount of text meant the queries took double the amount of time.

You see, they were doing similar text inserts over and over, instead of using connection pooling and/or batching them. Apparently the other DBA explained that it was a limitation of MySQL, but either the developers didn’t convey what they were doing well, or the DBA didn’t think to mention batching.

I ran a simple test on a test server. I used the commandline to connect to a db server on the same machine (even though in qa and production the db machine is on a different machine) just to make a point:

Queries per connect

Type Connects Queries Length of data transmitted Time
One-off 1000 1 619 bytes 12.232s
Single Connection 1 1000 604 kilobytes 0.268s
Batch 1 1 517 kilobytes 0.135s

So 1000 INSERTs using 1 connection is over 45 times faster than 1000 INSERTs using 1000 connections.
Using 1 batch INSERT statement is over 1.75 times faster than using 1 connection.
Using 1 batch INSERT statement is over 90 times faster than 1000 INSERTs using 1000 connections.

Note that while it’s faster to send a batch, if you don’t support sending 517 kilobytes to your database at once, you’ll want to break it up. That’s a small coding price to pay for 90x the database performance!!!

For reference, the formats used:
One-off:
INSERT INTO foo (col1, col2…) VALUES (val1, val2…);

Single Connection:
INSERT INTO foo (col1, col2…) VALUES (val1, val2…);
INSERT INTO foo (col1, col2…) VALUES (val1a, val2a…);

Batch: INSERT INTO foo (col1, col2…) VALUES (val1, val2…), (val1a, val2a);

OurSQL Episode 22: Things To Avoid With MySQL Queries

Feature:

Things to Avoid in Queries
Subqueries and Correlated subqueries
http://dev.mysql.com/doc/refman/4.1/en/correlated-subqueries.html

Jan Kneschke’s post on Groupwise Maximum:
http://jan.kneschke.de/projects/mysql/groupwise-max

Calculated comparisons do not use indexes

INSERT IGNORE
REPLACE
INSERT…ON DUPLICATE KEY

Feedback:

Email podcast@technocation.org

call the comment line at +1 617-674-2369

use Odeo to leave a voice mail through your computer:
http://odeo.com/sendmeamessage/Sheeri

or use the Technocation forums:
http://technocation.org/forum

Direct play this episode at:
http://technocation.org/content/oursql-episode-22%3A-things-avoid-mysql-queries

Feature:

Things to Avoid in Queries
Subqueries and Correlated subqueries
http://dev.mysql.com/doc/refman/4.1/en/correlated-subqueries.html

Jan Kneschke’s post on Groupwise Maximum:
http://jan.kneschke.de/projects/mysql/groupwise-max

Calculated comparisons do not use indexes

INSERT IGNORE
REPLACE
INSERT…ON DUPLICATE KEY

Feedback:

Email podcast@technocation.org

call the comment line at +1 617-674-2369

use Odeo to leave a voice mail through your computer:
http://odeo.com/sendmeamessage/Sheeri

or use the Technocation forums:
http://technocation.org/forum

Direct play this episode at:
http://technocation.org/content/oursql-episode-22%3A-things-avoid-mysql-queries

OurSQL Episode 19: MySQL Proxy

Direct play the episode at:
http://technocation.org/content/oursql-episode-19%3A-mysql-proxy-0

Feedback:
http://odeo.com/sendmeamessage/Sheeri

Call the comment line at +1 617-674-2369.
E-mail podcast@technocation.org

News:
MySQL Focuses on Japan
http://www.mysql.com/news-and-events/news/article_1368.html

MySQL Associate Certification Now Available
http://www.mysql.com/certification/

Learning resource:
Pythian Group’s Carnival of the Vanities for the DBA community, published weekly on Fridays.
http://www.pythian.com/blogs/category/log-buffer/

Feature:
MySQL Proxy
Giuseppe Maxia’s Blog:
http://datacharmer.blogspot.com

Getting Started with MySQL Proxy article plus tutorials:
http://www.oreillynet.com/pub/a/databases/2007/07/12/getting-started-with-mysql-proxy.html

public Subversion tree:

http://forge.mysql.com/wiki/mysql_proxy

Tutorials:
Intercept and dump queries (part 1): http://forge.mysql.com/snippets/view.php?id=75

Make macros to map “cd” to “use” and “ls” to “show tables” (part 2): http://forge.mysql.com/snippets/view.php?id=76

Injection Queries (part 3): http://forge.mysql.com/snippets/view.php?id=77

Lua interpreted language:
http://www.lua.org/

Direct play the episode at:
http://technocation.org/content/oursql-episode-19%3A-mysql-proxy-0

Feedback:
http://odeo.com/sendmeamessage/Sheeri

Call the comment line at +1 617-674-2369.
E-mail podcast@technocation.org

News:
MySQL Focuses on Japan
http://www.mysql.com/news-and-events/news/article_1368.html

MySQL Associate Certification Now Available
http://www.mysql.com/certification/

Learning resource:
Pythian Group’s Carnival of the Vanities for the DBA community, published weekly on Fridays.
http://www.pythian.com/blogs/category/log-buffer/

Feature:
MySQL Proxy
Giuseppe Maxia’s Blog:
http://datacharmer.blogspot.com

Getting Started with MySQL Proxy article plus tutorials:
http://www.oreillynet.com/pub/a/databases/2007/07/12/getting-started-with-mysql-proxy.html

public Subversion tree:

http://forge.mysql.com/wiki/mysql_proxy

Tutorials:
Intercept and dump queries (part 1): http://forge.mysql.com/snippets/view.php?id=75

Make macros to map “cd” to “use” and “ls” to “show tables” (part 2): http://forge.mysql.com/snippets/view.php?id=76

Injection Queries (part 3): http://forge.mysql.com/snippets/view.php?id=77

Lua interpreted language:
http://www.lua.org/

OurSQL Episode 18: De-myth-tifying Indexes

Direct play this episode at:
http://technocation.org/content/oursql-episode-18%3A-de-myth-tifying-indexes-0

Feedback:

Email podcast@technocation.org

call the comment line at +1 617-674-2369

use Odeo to leave a voice mail through your computer:
http://odeo.com/sendmeamessage/Sheeri

Or use the Technocation forums:
http://technocation.org/forum

Episode 18: De-Myth-tifying Indexes

Where I have been:
Wedding video (short) and photos:
http://www.sheeri.com/wedding

Honeymoon (and wedding) photos:
http://www.sheeri.com/photos

News:
FREE training for Oracle DBAs who want to learn MySQL
http://www.planetmysql.org/robin/?p=51
http://www.mysql.com/oracle/

mysqlnd (PHP native driver) needs testers and benchmarkers:
http://www.planetmysql.org/kaj/?p=117

Learning Resource:
The MySQL category at howtoforge.com:
http://www.howtoforge.com/taxonomy_menu/1/7

Podcast Promo:
http://www.themysqlguy.com

Feature:

Big O notation:
http://www.eecs.harvard.edu/~ellard/Q-97/HTML/root/node8.html

BTREE Index podcast:
http://technocation.org/content/oursql-episode-13%3A-nitty-gritty-indexes-0

HASH Index podcast:
http://technocation.org/content/oursql-episode-17%3A-hashing-it-out-0

Quote:
http://www.helpothers.org/story.php?sid=6784

Music:
The main theme used is Angry Red Dwarf’s “I Dream About You”
http://tinyurl.com/26hbg6

Smallfish’s “The Thank you song”
http://www.smallfishadventures.com/Home.html

Direct play this episode at:
http://technocation.org/content/oursql-episode-18%3A-de-myth-tifying-indexes-0

Feedback:

Email podcast@technocation.org

call the comment line at +1 617-674-2369

use Odeo to leave a voice mail through your computer:
http://odeo.com/sendmeamessage/Sheeri

Or use the Technocation forums:
http://technocation.org/forum

Episode 18: De-Myth-tifying Indexes

Where I have been:
Wedding video (short) and photos:
http://www.sheeri.com/wedding

Honeymoon (and wedding) photos:
http://www.sheeri.com/photos

News:
FREE training for Oracle DBAs who want to learn MySQL
http://www.planetmysql.org/robin/?p=51
http://www.mysql.com/oracle/

mysqlnd (PHP native driver) needs testers and benchmarkers:
http://www.planetmysql.org/kaj/?p=117

Learning Resource:
The MySQL category at howtoforge.com:
http://www.howtoforge.com/taxonomy_menu/1/7

Podcast Promo:
http://www.themysqlguy.com

Feature:

Big O notation:
http://www.eecs.harvard.edu/~ellard/Q-97/HTML/root/node8.html

BTREE Index podcast:
http://technocation.org/content/oursql-episode-13%3A-nitty-gritty-indexes-0

HASH Index podcast:
http://technocation.org/content/oursql-episode-17%3A-hashing-it-out-0

Quote:
http://www.helpothers.org/story.php?sid=6784

Music:
The main theme used is Angry Red Dwarf’s “I Dream About You”
http://tinyurl.com/26hbg6

Smallfish’s “The Thank you song”
http://www.smallfishadventures.com/Home.html

OurSQL Episode 10: How About Some Cache?

This week I talk about the MySQL Query Cache.

Direct play the podcast here:
http://technocation.org/content/oursql-episode-10%3A-how-about-some-cache%3F-0

Subscribe to the podcast by clicking:
http://phobos.apple.com/WebObjects/MZStore.woa/wa/viewPodcast?id=206806301

You can Direct download all the oursql podcasts at:
http://technocation.org/podcasts/oursql/

Show notes:

Listener Feedback:
http://dev.mysql.com/doc/refman/4.1/en/mysql-commands.html

News:
Daylight Savings Time and how to check your system:
http://sheeri.net/archives/188

There’s not much more time left to register for the MySQL Users Conference & Expo before the $200 early bird discount disappears!
http://www.mysqlconf.com

Learning Resource:

Check out the 2006 MySQL conference presentation slides by the speakers! http://mysqlconf.com/pub/w/45/presentations.html

Feature: How about some cache?
The MySQL Manual has a short, very readable chapter on the Query Cache, which starts here:

http://dev.mysql.com/doc/refman/5.0/en/query-cache.html

This week I talk about the MySQL Query Cache.

Direct play the podcast here:
http://technocation.org/content/oursql-episode-10%3A-how-about-some-cache%3F-0

Subscribe to the podcast by clicking:
http://phobos.apple.com/WebObjects/MZStore.woa/wa/viewPodcast?id=206806301

You can Direct download all the oursql podcasts at:
http://technocation.org/podcasts/oursql/

Show notes:

Listener Feedback:
http://dev.mysql.com/doc/refman/4.1/en/mysql-commands.html

News:
Daylight Savings Time and how to check your system:
http://sheeri.net/archives/188

There’s not much more time left to register for the MySQL Users Conference & Expo before the $200 early bird discount disappears!
http://www.mysqlconf.com

Learning Resource:

Check out the 2006 MySQL conference presentation slides by the speakers! http://mysqlconf.com/pub/w/45/presentations.html

Feature: How about some cache?
The MySQL Manual has a short, very readable chapter on the Query Cache, which starts here:

http://dev.mysql.com/doc/refman/5.0/en/query-cache.html

MySQL Queues, part II — groups of queues

I believe this is a huge optimization for a heavily implemented Web 2.0 idea.

This article makes simple work of groups of queues. An example of this would be “the most recent 10 people to view an article,” so each article has a queue of up to 10 items in it. This method eliminates the need for multiple SQL statements or using TRIGGERS to check to see if the queue is full.

I bow down to Baron Schwartz, aka Xarpb, for his article on how to implement a queue in SQL:

http://www.xaprb.com/blog/2007/01/11/how-to-implement-a-queue-in-sql/

I am very excited because this also works for groups of objects, and we’re about to implement something at work that needs this idea. The idea of “the most recent x things” or “the top x things” is huge, especially in social networking, and probably one of the most often sought after features.

The biggest issue is that in order to display, say, the most recent posts, a query has to find the time of all the posts and only get the most recent 10. This can be made easy by the logic that the 10 most recent posts are the last 10 rows in the table. Any logic is also added, as in “the last 10 rows in the table viewable and for this guest/login.”

What if you want to track the last 10 people to view the post? Aha, this gets trickier. Convention would say that when a person views a post, have an SQL transaction that adds the information (person x viewed post y at time z and anyo other info, such as browser type, IP, etc) and if there are more than 10 entries for that post, delete the oldest ones until you have 10 entries. This transaction could be done via the application code or via triggers in MySQL 5.0 and up.

However, both those methods use multiple SQL queries, and in the case that an article has been viewed fewer than 10 times, the queries are unnecessary. And given each article has a different popularity — some are viewed lots more than others — running multiple queries ends up being a waste of cycles for articles whose last 10 viewers change infrequently.

These commands were tested on MySQL 4.1.19-standard-log. I use REPLACE INTO because it’s shorter than SELECT…ON DUPLICATE KEY UPDATE, and yes, those aren’t

Let’s say you have a New Year’s Resolution to eat 5 servings of fruits and 5 servings of vegetables per day. The only thing that changes from Baron’s example is that we add a group field (called ‘kind’). The “fruit” field was changed to “edible” and will still contain the name of the edible.

As Baron does, I will use a MySQL-specific command. However, he used SELECT...ON DUPLICATE KEY and I will use REPLACE, as it is smaller in syntax.

use test;
CREATE TABLE q (
id int NOT NULL,
modulo int NOT NULL,
kind char(1) NOT NULL,
food varchar(10) NOT NULL,
PRIMARY KEY(id,kind),
UNIQUE KEY(modulo,kind)
);

The basic statement is below — I’ve added AS clauses to make the variables more clear. The modulus is, in this case, 5, but in the article case above would be 10. The “kind” is either “f” or “v”, these are your groups of queues. In this case they stand for “fruits” and “vegetables” but they might be numbers referring to articles. The “food” stands for the type of food eaten, but in the article scenario would represent the username or user id of the customer viewing the article.

REPLACE INTO q (id, modulo, kind, food)
SELECT
(COALESCE(MAX(id), -1) + 1) AS id,
(COALESCE(MAX(id), -1) + 1) MOD 5 AS modulo,
'f' AS kind,
'apple' AS food
FROM q WHERE kind='f';

mysql> SELECT * FROM q order by kind,id;

id modulo kind food
0 0 f apple

As expected, 1 "fruit" row.

mysql> REPLACE INTO q(id, modulo, kind, food)
-> SELECT
-> (COALESCE(MAX(id), -1) + 1),
-> (COALESCE(MAX(id), -1) + 1) MOD 5,
-> 'f',
-> 'orange'
-> FROM q WHERE kind='f';
Query OK, 1 row affected (0.00 sec)
Records: 1 Duplicates: 0 Warnings: 0

mysql> SELECT * FROM q order by kind,id;

id modulo kind food
0 0 f apple
1 1 f orange

As expected, 2 "fruit" rows.


mysql> REPLACE INTO q(id, modulo, kind, food)
-> SELECT
-> (COALESCE(MAX(id), -1) + 1),
-> (COALESCE(MAX(id), -1) + 1) MOD 5,
-> 'v',
-> 'okra'
-> FROM q WHERE kind='v';
Query OK, 1 row affected (0.00 sec)
Records: 1 Duplicates: 0 Warnings: 0

mysql> SELECT * FROM q order by kind,id;

id modulo kind food
0 0 f apple
1 1 f orange
0 0 v okra

As expected, 2 "fruit" rows and 1 "vegetable" row. Now, let's quickly populate the fields so the "fruit" group reaches it's maximum of 5.

REPLACE INTO q(id, modulo, kind, food)
SELECT
(COALESCE(MAX(id), -1) + 1),
(COALESCE(MAX(id), -1) + 1) MOD 5,
'v',
'squash'
FROM q WHERE kind='v';

REPLACE INTO q(id, modulo, kind, food)
SELECT
(COALESCE(MAX(id), -1) + 1),
(COALESCE(MAX(id), -1) + 1) MOD 5,
'f',
'peach'
FROM q WHERE kind='f';

REPLACE INTO q(id, modulo, kind, food)
SELECT
(COALESCE(MAX(id), -1) + 1),
(COALESCE(MAX(id), -1) + 1) MOD 5,
'f',
'cherries'
FROM q WHERE kind='f';

REPLACE INTO q(id, modulo, kind, food)
SELECT
(COALESCE(MAX(id), -1) + 1),
(COALESCE(MAX(id), -1) + 1) MOD 5,
'f',
'pear'
FROM q WHERE kind='f';

REPLACE INTO q(id, modulo, kind, food)
SELECT
(COALESCE(MAX(id), -1) + 1),
(COALESCE(MAX(id), -1) + 1) MOD 5,
'v',
'celery'
FROM q WHERE kind='v';

SELECT * FROM q order by kind,id;

id modulo kind food
0 0 f apple
1 1 f orange
2 2 f peach
3 3 f cherries
4 4 f pear
0 0 v okra
1 1 v squash
2 2 v celery

We have 5 values in the “fruit” group and 3 values in the “veggie” group. Now let’s see what happens when another fruit is added:

REPLACE INTO q(id, modulo, kind, food)
SELECT
(COALESCE(MAX(id), -1) + 1),
(COALESCE(MAX(id), -1) + 1) MOD 5,
'f',
'banana'
FROM q WHERE kind='f';
Query OK, 2 rows affected (0.00 sec)
Records: 1 Duplicates: 1 Warnings: 0

Note that a duplicate has been found! This is because the modulo wrapped around. The id of “banana” is 5, and 5 modulo 5 = 0 – the same as 0 modulo 5, which was the modulo value previously taken by “apple”. So “apple” is pushed off the end of the queue.

SELECT * FROM q order by kind,id;

id modulo kind food
1 1 f orange
2 2 f peach
3 3 f cherries
4 4 f pear
0 5 f banana
0 0 v okra
1 1 v squash
2 2 v celery

To find the current list of all fruits, with the most recent fruit first, run:

SELECT * FROM q WHERE kind='f' ORDER BY id DESC;

id modulo kind food
1 1 f orange
2 2 f peach
3 3 f cherries
4 4 f pear
0 5 f banana

Let’s get back to the example of page views, though. We probably care about when the pages were viewed, so let’s add a timestamp:

ALTER TABLE q ADD COLUMN fed TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP;

I ran the queries again, with some delays, so the timestamps wouldn’t all be the same.

SELECT * FROM q order by id,kind;

id modulo kind food fed
1 1 f orange 2007-01-15 14:48:25
2 2 f peach 2007-01-15 14:48:28
3 3 f cherries 2007-01-15 14:48:28
4 4 f pear 2007-01-15 14:48:31
5 0 f banana 2007-01-15 14:48:34
1 1 v squash 2007-01-15 14:48:28
2 2 v celery 2007-01-15 14:48:31
3 3 v beet 2007-01-15 14:48:31
4 4 v spinach 2007-01-15 14:48:34
5 0 v cucumber 2007-01-15 14:48:34

Or, what the query would be in a real system — find all fruits eaten and sort by time, most recent first:

SELECT food,fed FROM q WHERE kind=’f’ ORDER BY fed DESC;

banana 2007-01-15 14:48:34
pear 2007-01-15 14:48:31
peach 2007-01-15 14:48:28
cherries 2007-01-15 14:48:28
orange 2007-01-15 14:48:25

edit:
One edge case to be aware of — reaching the limit of the id field.

If your application does 100 of these per second, an UNSIGNED INT will last 1.36 years — not so great. You should use an UNSIGNED INT because you’re never going to have a negative number, and a signed int would only last just over 8 months if there were 100 REPLACEs per second.

(60 seconds/min, 60 min/hr, 24 hrs/day, 365 days/yr)
4294967295/60/60/24/365/100=1.3619251950

So, for scaling/known high traffic, use a BIGINT. However, in this case, do NOT use UNSIGNED, as all MySQL arithmetic is done with signed bigints or doubles. Not that it matters in this case; at 100 REPLACEs per second, you will wrap at 2.9 billion years:

9223372036854775807/60/60/24/365/100=2,924,712,086.7753601074

Let’s say your system does 10,000 of these REPLACEs per second, for eternity (our main database system, where we’re about to use this, average 6,000 qps, not all writes, but it’s a good figure to use for our own numbers) — move the decimal places a few spots over and you’re down to running out of numbers in 29 million years.

That’s an OK limit for us. 🙂

I believe this is a huge optimization for a heavily implemented Web 2.0 idea.

This article makes simple work of groups of queues. An example of this would be “the most recent 10 people to view an article,” so each article has a queue of up to 10 items in it. This method eliminates the need for multiple SQL statements or using TRIGGERS to check to see if the queue is full.

I bow down to Baron Schwartz, aka Xarpb, for his article on how to implement a queue in SQL:

http://www.xaprb.com/blog/2007/01/11/how-to-implement-a-queue-in-sql/

I am very excited because this also works for groups of objects, and we’re about to implement something at work that needs this idea. The idea of “the most recent x things” or “the top x things” is huge, especially in social networking, and probably one of the most often sought after features.

The biggest issue is that in order to display, say, the most recent posts, a query has to find the time of all the posts and only get the most recent 10. This can be made easy by the logic that the 10 most recent posts are the last 10 rows in the table. Any logic is also added, as in “the last 10 rows in the table viewable and for this guest/login.”

What if you want to track the last 10 people to view the post? Aha, this gets trickier. Convention would say that when a person views a post, have an SQL transaction that adds the information (person x viewed post y at time z and anyo other info, such as browser type, IP, etc) and if there are more than 10 entries for that post, delete the oldest ones until you have 10 entries. This transaction could be done via the application code or via triggers in MySQL 5.0 and up.

However, both those methods use multiple SQL queries, and in the case that an article has been viewed fewer than 10 times, the queries are unnecessary. And given each article has a different popularity — some are viewed lots more than others — running multiple queries ends up being a waste of cycles for articles whose last 10 viewers change infrequently.

These commands were tested on MySQL 4.1.19-standard-log. I use REPLACE INTO because it’s shorter than SELECT…ON DUPLICATE KEY UPDATE, and yes, those aren’t

Let’s say you have a New Year’s Resolution to eat 5 servings of fruits and 5 servings of vegetables per day. The only thing that changes from Baron’s example is that we add a group field (called ‘kind’). The “fruit” field was changed to “edible” and will still contain the name of the edible.

As Baron does, I will use a MySQL-specific command. However, he used SELECT...ON DUPLICATE KEY and I will use REPLACE, as it is smaller in syntax.

use test;
CREATE TABLE q (
id int NOT NULL,
modulo int NOT NULL,
kind char(1) NOT NULL,
food varchar(10) NOT NULL,
PRIMARY KEY(id,kind),
UNIQUE KEY(modulo,kind)
);

The basic statement is below — I’ve added AS clauses to make the variables more clear. The modulus is, in this case, 5, but in the article case above would be 10. The “kind” is either “f” or “v”, these are your groups of queues. In this case they stand for “fruits” and “vegetables” but they might be numbers referring to articles. The “food” stands for the type of food eaten, but in the article scenario would represent the username or user id of the customer viewing the article.

REPLACE INTO q (id, modulo, kind, food)
SELECT
(COALESCE(MAX(id), -1) + 1) AS id,
(COALESCE(MAX(id), -1) + 1) MOD 5 AS modulo,
'f' AS kind,
'apple' AS food
FROM q WHERE kind='f';

Continue reading “MySQL Queues, part II — groups of queues”

Lots of Boolean Values

So, the biggest issue from last month’s Boston MySQL Meetup was “What is the most efficient way to query across many values that are boolean in nature?”

Now, it was a bit more complex than that, as it always is. Values may be set or not set, and people may search across any number of boolean values, or all of them. There are over 1 million entries, and over 40 boolean values!

A few ways of doing so came up:

1) Simply put all the values into separate columns, with a BOOLEAN type (really TINYINT(1))
2) Using ENUMs
3) Using foreign keys to “join tables”
4) Using SET
5) Using bits and bit operations

Using real data, the original questioner and I sought to find the best answer for his case.

So, the biggest issue from last month’s Boston MySQL Meetup was “What is the most efficient way to query across many values that are boolean in nature?”

Now, it was a bit more complex than that, as it always is. Values may be set or not set, and people may search across any number of boolean values, or all of them. There are over 1 million entries, and over 40 boolean values!

A few ways of doing so came up:

1) Simply put all the values into separate columns, with a BOOLEAN type (really TINYINT(1))
2) Using ENUMs
3) Using foreign keys to “join tables”
4) Using SET
5) Using bits and bit operations

Using real data, the original questioner and I sought to find the best answer for his case.

One TRIGGER Fact, Optimizer Rewriting Stuff

  • One thing about TRIGGERs is that you can have 6 triggers per table:
    BEFORE
    AFTER
        INSERT
    UPDATE
    DELETE

    That was brought up at the User Group meeting.

    In other news, I have been reading the MySQL Internals Manual. The most interesting chapter yet is Chapter 3, the optimizer. This can be read out of order, and actually contains lots of information that folks can use. For instance:

    WHERE col1=col2 AND col2=’x’
    transforms to
    WHERE col1=’x’ AND col2=’x’

    because constant matching is faster.

    Tautologies and always-false statements are usually optimized out of the query so that something like WHERE 0=0 does not waste time by actually being tested.

    A table with 0 or 1 rows is a constant value. Another constant table scenario is WHERE col1=[constant] for all the columns of any UNIQUE or PRIMARY KEYS, assuming they are defined as NOT NULL. This is a good reason to use NOT NULL for column definitions unless you really need NULL.

    A constant table is what is used by the optimizer when you see “const” as the “join type” in the result of an EXPLAIN.

    WHERE col1 in (1,2,3) is the exact same expression as WHERE col1=1 OR col1=2 OR col1=3

    WHERE col1 BETWEEN 1 AND 3 is the exact same expression as WHERE col1>=1 AND col1< =3

    The latter 2 examples are good to know, because the first of the two statements in each case are more clear.

  • One thing about TRIGGERs is that you can have 6 triggers per table:
    BEFORE
    AFTER
        INSERT
    UPDATE
    DELETE

    That was brought up at the User Group meeting.

    In other news, I have been reading the MySQL Internals Manual. The most interesting chapter yet is Chapter 3, the optimizer. This can be read out of order, and actually contains lots of information that folks can use. For instance:

    WHERE col1=col2 AND col2=’x’
    transforms to
    WHERE col1=’x’ AND col2=’x’

    because constant matching is faster.

    Tautologies and always-false statements are usually optimized out of the query so that something like WHERE 0=0 does not waste time by actually being tested.

    A table with 0 or 1 rows is a constant value. Another constant table scenario is WHERE col1=[constant] for all the columns of any UNIQUE or PRIMARY KEYS, assuming they are defined as NOT NULL. This is a good reason to use NOT NULL for column definitions unless you really need NULL.

    A constant table is what is used by the optimizer when you see “const” as the “join type” in the result of an EXPLAIN.

    WHERE col1 in (1,2,3) is the exact same expression as WHERE col1=1 OR col1=2 OR col1=3

    WHERE col1 BETWEEN 1 AND 3 is the exact same expression as WHERE col1>=1 AND col1< =3

    The latter 2 examples are good to know, because the first of the two statements in each case are more clear.

  • Semi-Dynamic Data

    http://www.artfulsoftware.com has a “Common Queries” page, which I find rather useful. I hadn’t realized its history, as described on the homepage:

    Our collection of common MySQL queries outgrew Chapter 9, and is still growing, so we turned it into a PHP page driven from a MySQL table.

    One day, I clicked on the page and got the dreaded “blank” PHP page.

    This gets into one of the fundamental flaws I find with “semi-dynamic data” (my terminology for it) — it is not completely dynamic data, because it gets updated by humans, and it is deterministic*, so it does not need a completely dynamic page.

    Part of the updating process could be a “generate the web page” script, that runs what the actual page is now, but stores the result as an HTML page. In this way, if 1000 users want the same page, there are *no* database queries done. After all, it only needs to change when content is uploaded, which isn’t very often at all.

    The “generation” script could easily be a part of a web form that uploads content, or it could be a separate form/script run after a batch of changes is done, so multiple changes do not require generating pages that will just be written over after the next update in a minute or so. As well, it could write to a temporary file, and the very last script action would move the temporary file to the right place. In this way, a generation script that takes a long time to finish would not be partially overwritten by another, simultaneous generation script.

    I have used this technique in content management systems — particularly with templates, as I’ve found you can separate different content items (such as menus) and “break apart” a template into pieces, and with about an hour you can support a new template into an existing system, and have a user compare templates to see which they’d rather use, given their own content.

    I have also used this technique with a listing of organizations around the world. All the listings (3,000) were stored in a database. From this, I ran a weekly (but it would be easy to run it hourly or daily) script that made “browse by” pages, categorizing all of the resources by first letter of their name as well as by their area, province/state and country. The script, which took a full 10 minutes due to poor optimization, made an overall browsing page, 26 “by letter” pages, one page for each country, and a directory with one page for each state/province and area for each country. It also generated the page and compared it to the existing page, and only overwrote the page when they differed (and then put a “last updated on:” message at the end).

    Folks searching could still get truly dynamic pages, but I cut down on needless direct database calls to find out which organizations were in “England”, and more needless database calls to find out which organizations were in the “Greater London” area, when those pages changes rather rarely.

    This could also be useful for large sites, such as photo sharing galleries. Sites that allow comments may or may not see a performance gain — for each new comment, generating the page again may not be the best solution. However, if there’s at least one page hit for each database write, then using this method will have better performance.

    * an example of a nondeterministic page is one that changes based on the time, such as “show me all activity that has happened today [up until this moment]” because it may change from second to second.

    http://www.artfulsoftware.com has a “Common Queries” page, which I find rather useful. I hadn’t realized its history, as described on the homepage:

    Our collection of common MySQL queries outgrew Chapter 9, and is still growing, so we turned it into a PHP page driven from a MySQL table.

    One day, I clicked on the page and got the dreaded “blank” PHP page.

    This gets into one of the fundamental flaws I find with “semi-dynamic data” (my terminology for it) — it is not completely dynamic data, because it gets updated by humans, and it is deterministic*, so it does not need a completely dynamic page.

    Part of the updating process could be a “generate the web page” script, that runs what the actual page is now, but stores the result as an HTML page. In this way, if 1000 users want the same page, there are *no* database queries done. After all, it only needs to change when content is uploaded, which isn’t very often at all.

    The “generation” script could easily be a part of a web form that uploads content, or it could be a separate form/script run after a batch of changes is done, so multiple changes do not require generating pages that will just be written over after the next update in a minute or so. As well, it could write to a temporary file, and the very last script action would move the temporary file to the right place. In this way, a generation script that takes a long time to finish would not be partially overwritten by another, simultaneous generation script.

    I have used this technique in content management systems — particularly with templates, as I’ve found you can separate different content items (such as menus) and “break apart” a template into pieces, and with about an hour you can support a new template into an existing system, and have a user compare templates to see which they’d rather use, given their own content.

    I have also used this technique with a listing of organizations around the world. All the listings (3,000) were stored in a database. From this, I ran a weekly (but it would be easy to run it hourly or daily) script that made “browse by” pages, categorizing all of the resources by first letter of their name as well as by their area, province/state and country. The script, which took a full 10 minutes due to poor optimization, made an overall browsing page, 26 “by letter” pages, one page for each country, and a directory with one page for each state/province and area for each country. It also generated the page and compared it to the existing page, and only overwrote the page when they differed (and then put a “last updated on:” message at the end).

    Folks searching could still get truly dynamic pages, but I cut down on needless direct database calls to find out which organizations were in “England”, and more needless database calls to find out which organizations were in the “Greater London” area, when those pages changes rather rarely.

    This could also be useful for large sites, such as photo sharing galleries. Sites that allow comments may or may not see a performance gain — for each new comment, generating the page again may not be the best solution. However, if there’s at least one page hit for each database write, then using this method will have better performance.

    * an example of a nondeterministic page is one that changes based on the time, such as “show me all activity that has happened today [up until this moment]” because it may change from second to second.