Log Buffer #72 — a Carnival of the Vanities for DBAs

Welcome to the 72nd edition of Log Buffer, the weekly review of database blogs.

Oracle OpenWorld (OOW) is over, and Lucas Jellema of the AMIS Technology blog notes the OOW Content Catalog has been updated with most of the presentations available for download.

On his way home from OOW, Chris Muir of the appropriately titled One Size Doesn’t Fit All blog notes how OOW and the Australian Oracle User Group Conference and OOW compare with regards to 99% fewer attendees in AUSOUG Perth conference – from 45k down to 350.

Mark Rittman of Rittman Mead Consulting summarizes OOW’s impact on business intelligence and data warehousing in Reflections on Oracle’s BI Strategy. On his way home, Mark found time for A First Look at Oracle OLAP 11g, noting the pros, cons, gotchas and suggestions for improvement for many useful new features.

Microsoft SQL Server also has a new release in the works. Ted Malone in Agile Methods for the DB Dev is excited about SQL Server 2008 “Katmai” CTP 5 New Features and descries almost 20 of them.

Ian Barwick of PostgreSQL Notes talks about Converting tsearch2 to 8.3 now that the tsearch2 full text search engine has been integrated as a core PostgreSQL feature.

Patrick Barel of the Bar Solutions Weblog explains a new feature of Oracle 11g called Virtual Columns. While virtual data may be a new topic, using databases on virtual machines is an ongoing issue. Marco Russo of SQL BI gives his opinion on when to use virtual machines in SQL Server Virtualization.

Database professionals can be real characters, and set in their ways. Bad puns make good transitions, and Corrado Pandiani sheds light on MySQL’s rules for Charsets and Collations on Multicolumn Fulltext Indexes. Adam Douglas of Binary Expressions fixed some trouble with MySQL and French Characters not rendering properly.

Greg Sabino Mullane shows reasons for his Problems with pl/perl and UTF-8. In Tending the Garden, Selena Deckelmann goes through the very easy process of Automatic Character Set Conversion in PostgreSQL. Selena has also been busy organizing the development of ptop, an interactive, command-line tool for monitoring the current status of a PostgreSQL database. If you read this in time and are in the Portland, Oregon area you can join the ptop hackathon at noon (local time) tomorrow, Saturday November 24th, or you can read the ptop meeting summary from pdxpug.

While some of us are database tools, some of us prefer to contribute database tools. Baron Schwartz honors MySQL’s trademark by announcing that MySQL Toolkit is now Ma’atkit. Ma’at, pronounced “mott”, is the ancient Egyption patron saint of truth, harmony and order. In addition, Baron proclaims “Ma’atkit Version 1297 Released!”

Hubert Lubaczewski notes the changes to the analyze.pgsql.logs.pl script of pgsql-tools in update 3 and update 4.

Hubert also notes how to find overlapping time ranges and how to find the number of ranges a time belongs to in time ranges in postgresql – part 2. Though written for PostgreSQL, both posts can easily be applied to another DBMS. In the same vein, Yves Trudeau shares the DBMS-independent graphical images of Unix memory usage in Generating graphs from vmstat output.

Jeromy McMahon posts sample SQL code for viewing Oracle extent segments for tablespaces, temporary spaces and sort segment space. The Cheap DBA gets Oracle specific with a Slick Shell Script for Reporting on Oracle Workload. Krister Axel of codeboxer.com has A really clean dynamic insert proc for PL/SQL ETL packages, including validation checking and exception handling. zillablog‘s Robert Treat treats us to a function for tracking plperl shared variables.

Jen M is Keeping IT simple by coding capacity measurements to show How Not to Outgrow Your DB Infra: A Simple Step. She follows up with more code to monitor a specific cache to resolve unexplainable slowness/resource leak in SQL Server.

This post began with a conference, and so it shall conclude. The Call For Proposals for PgCon 2008 is underway, and David Fetter lets us know that PgCon 2008 will be held May 22-23 at the University of Ottawa. This is different from Joshua Drake‘s call for volunteers for Command Prompt’s Postgresql Conference East 08, on March 28-29 at the University of Maryland. Neil Conway informs us of a Jim Gray Tribute, consisting of a general session and 9 half-hour technical sessions reviewing some of the 1998 Turing Award winner’s work.

In case this edition did not give you enough to read, Beth Breidenbach of Confessions of a Database Geek created an aggregate blog feed for posts relating to information quality.

Welcome to the 72nd edition of Log Buffer, the weekly review of database blogs.

Oracle OpenWorld (OOW) is over, and Lucas Jellema of the AMIS Technology blog notes the OOW Content Catalog has been updated with most of the presentations available for download.

On his way home from OOW, Chris Muir of the appropriately titled One Size Doesn’t Fit All blog notes how OOW and the Australian Oracle User Group Conference and OOW compare with regards to 99% fewer attendees in AUSOUG Perth conference – from 45k down to 350.

Mark Rittman of Rittman Mead Consulting summarizes OOW’s impact on business intelligence and data warehousing in Reflections on Oracle’s BI Strategy. On his way home, Mark found time for A First Look at Oracle OLAP 11g, noting the pros, cons, gotchas and suggestions for improvement for many useful new features.

Microsoft SQL Server also has a new release in the works. Ted Malone in Agile Methods for the DB Dev is excited about SQL Server 2008 “Katmai” CTP 5 New Features and descries almost 20 of them.

Ian Barwick of PostgreSQL Notes talks about Converting tsearch2 to 8.3 now that the tsearch2 full text search engine has been integrated as a core PostgreSQL feature.

Patrick Barel of the Bar Solutions Weblog explains a new feature of Oracle 11g called Virtual Columns. While virtual data may be a new topic, using databases on virtual machines is an ongoing issue. Marco Russo of SQL BI gives his opinion on when to use virtual machines in SQL Server Virtualization.

Database professionals can be real characters, and set in their ways. Bad puns make good transitions, and Corrado Pandiani sheds light on MySQL’s rules for Charsets and Collations on Multicolumn Fulltext Indexes. Adam Douglas of Binary Expressions fixed some trouble with MySQL and French Characters not rendering properly.

Greg Sabino Mullane shows reasons for his Problems with pl/perl and UTF-8. In Tending the Garden, Selena Deckelmann goes through the very easy process of Automatic Character Set Conversion in PostgreSQL. Selena has also been busy organizing the development of ptop, an interactive, command-line tool for monitoring the current status of a PostgreSQL database. If you read this in time and are in the Portland, Oregon area you can join the ptop hackathon at noon (local time) tomorrow, Saturday November 24th, or you can read the ptop meeting summary from pdxpug.

While some of us are database tools, some of us prefer to contribute database tools. Baron Schwartz honors MySQL’s trademark by announcing that MySQL Toolkit is now Ma’atkit. Ma’at, pronounced “mott”, is the ancient Egyption patron saint of truth, harmony and order. In addition, Baron proclaims “Ma’atkit Version 1297 Released!”

Hubert Lubaczewski notes the changes to the analyze.pgsql.logs.pl script of pgsql-tools in update 3 and update 4.

Hubert also notes how to find overlapping time ranges and how to find the number of ranges a time belongs to in time ranges in postgresql – part 2. Though written for PostgreSQL, both posts can easily be applied to another DBMS. In the same vein, Yves Trudeau shares the DBMS-independent graphical images of Unix memory usage in Generating graphs from vmstat output.

Jeromy McMahon posts sample SQL code for viewing Oracle extent segments for tablespaces, temporary spaces and sort segment space. The Cheap DBA gets Oracle specific with a Slick Shell Script for Reporting on Oracle Workload. Krister Axel of codeboxer.com has A really clean dynamic insert proc for PL/SQL ETL packages, including validation checking and exception handling. zillablog‘s Robert Treat treats us to a function for tracking plperl shared variables.

Jen M is Keeping IT simple by coding capacity measurements to show How Not to Outgrow Your DB Infra: A Simple Step. She follows up with more code to monitor a specific cache to resolve unexplainable slowness/resource leak in SQL Server.

This post began with a conference, and so it shall conclude. The Call For Proposals for PgCon 2008 is underway, and David Fetter lets us know that PgCon 2008 will be held May 22-23 at the University of Ottawa. This is different from Joshua Drake‘s call for volunteers for Command Prompt’s Postgresql Conference East 08, on March 28-29 at the University of Maryland. Neil Conway informs us of a Jim Gray Tribute, consisting of a general session and 9 half-hour technical sessions reviewing some of the 1998 Turing Award winner’s work.

In case this edition did not give you enough to read, Beth Breidenbach of Confessions of a Database Geek created an aggregate blog feed for posts relating to information quality.

Top 10 MySQL Best Practices

So, O’Reilly’s ONLamp.com has published the “Top 10 MySQL Best Practices” at http://www.onlamp.com/pub/a/onlamp/2002/07/11/MySQLtips.html. Sadly, I find most “best practice” list do not thoroughly explain the “why” enough so that people can make their own decisions.

For instance, #3 is “Protect the MySQL installation directory from access by other users.” I was intrigued at what they would consider the “installation” directory. By reading the tip, they actually mean the data directory. They say nothing of the log directory, nor that innodb data files may be in different places than the standard myisam data directories.

They perpetuate a myth in #4, “Don’t store binary data in MySQL.” What they really mean is “don’t store large data in MySQL”, which they go into in the tip. While it’s true that there is very little benefit to having binary data in a database, they don’t go into what those benefits are. This means that people can’t make informed decisions, just “the best practice is this so I’m doing it.”

The benefit of putting binary data in MySQL is to be able to associate metadata and other data. For instance, “user 200 owns file 483”. If user 200 is gone from the system, how can you make sure file 483 is as well? There’s no referential integrity unless it’s in the database. While it’s true that in most cases people would rather sacrifice the referential integrity for things like faster database backups and easier partitioning of large data objects, I believe in giving people full disclosure so they can make their own informed decision.

#5 is my biggest pet peeve. “Stick to ANSI SQL,” with the goal being to be able to migrate to a different platform without having to rewrite the code. Does anyone tell Oracle folks not to use pl/sql like collections? Nobody says “SQL is a declarative language, pl/sql is procedural therefore you should never use it”. How about SQL Server folks not to use transact-sql statements like WAITFOR? MATCH… AGAINST is not standard SQL, so I should never use it?

Now, of course, if you’re selling a product to be run on different database platforms, then sure, you want to be platform agnostic. But you’d know that from the start. And if you have to migrate platforms you’re going to have to do lots of work anyway, because there are third-party additions to all the software any way.

And why would *anyone* choose a specific database, and then *not* use those features? I think that it’s a good tip to stick to ANSI SQL if you *know* you want to, or if you have no idea about the DBMS you’re using.

If you want to see how this cripples MySQL, check out Visibone’s SQL chart at: http://www.visibone.com/sql/chart_1200.jpg — you can buy it here: http://sheeri.com/archives/104. I may post later on about my own personal MySQL Best Practices….

So, O’Reilly’s ONLamp.com has published the “Top 10 MySQL Best Practices” at http://www.onlamp.com/pub/a/onlamp/2002/07/11/MySQLtips.html. Sadly, I find most “best practice” list do not thoroughly explain the “why” enough so that people can make their own decisions.

For instance, #3 is “Protect the MySQL installation directory from access by other users.” I was intrigued at what they would consider the “installation” directory. By reading the tip, they actually mean the data directory. They say nothing of the log directory, nor that innodb data files may be in different places than the standard myisam data directories.

They perpetuate a myth in #4, “Don’t store binary data in MySQL.” What they really mean is “don’t store large data in MySQL”, which they go into in the tip. While it’s true that there is very little benefit to having binary data in a database, they don’t go into what those benefits are. This means that people can’t make informed decisions, just “the best practice is this so I’m doing it.”

The benefit of putting binary data in MySQL is to be able to associate metadata and other data. For instance, “user 200 owns file 483”. If user 200 is gone from the system, how can you make sure file 483 is as well? There’s no referential integrity unless it’s in the database. While it’s true that in most cases people would rather sacrifice the referential integrity for things like faster database backups and easier partitioning of large data objects, I believe in giving people full disclosure so they can make their own informed decision.

#5 is my biggest pet peeve. “Stick to ANSI SQL,” with the goal being to be able to migrate to a different platform without having to rewrite the code. Does anyone tell Oracle folks not to use pl/sql like collections? Nobody says “SQL is a declarative language, pl/sql is procedural therefore you should never use it”. How about SQL Server folks not to use transact-sql statements like WAITFOR? MATCH… AGAINST is not standard SQL, so I should never use it?

Now, of course, if you’re selling a product to be run on different database platforms, then sure, you want to be platform agnostic. But you’d know that from the start. And if you have to migrate platforms you’re going to have to do lots of work anyway, because there are third-party additions to all the software any way.

And why would *anyone* choose a specific database, and then *not* use those features? I think that it’s a good tip to stick to ANSI SQL if you *know* you want to, or if you have no idea about the DBMS you’re using.

If you want to see how this cripples MySQL, check out Visibone’s SQL chart at: http://www.visibone.com/sql/chart_1200.jpg — you can buy it here: http://sheeri.com/archives/104. I may post later on about my own personal MySQL Best Practices….

Virtualization and MySQL

So, the article at:

http://mysql-dba-journey.blogspot.com/2007/11/mysql-and-vmware.html says:

Don’t get seduced to the dark side unless you understand all the issues.

And that’s wonderful and all, but….what are all the issues? What are some of the issues? Is it related more to VMware, or more to MySQL, or more to MySQL on VMware? Is it something like “VMware isn’t stable” or more like “load testing on vmware isn’t always going to work because you won’t have full resources”?

Many people talk about using virtualization for development and testing….but if you develop and test on a virtual machine and then put it on a physical machine for production, isn’t that basically having differing environments for dev/testing and production, which is usually seen as bad? If a line of code crashes a virtual machine but is fine on production, is it worth tracking the bug down? How many hours will you spend doing that?

Also, how is using a virtual machine better/worse/different from using something like mysqld_multi on a machine with many IPs, or other strategies folks use in dev/test so they don’t have to buy the exact same hardware as in production, but still have the same separation of databases, etc?

So, the article at:

http://mysql-dba-journey.blogspot.com/2007/11/mysql-and-vmware.html says:

Don’t get seduced to the dark side unless you understand all the issues.

And that’s wonderful and all, but….what are all the issues? What are some of the issues? Is it related more to VMware, or more to MySQL, or more to MySQL on VMware? Is it something like “VMware isn’t stable” or more like “load testing on vmware isn’t always going to work because you won’t have full resources”?

Many people talk about using virtualization for development and testing….but if you develop and test on a virtual machine and then put it on a physical machine for production, isn’t that basically having differing environments for dev/testing and production, which is usually seen as bad? If a line of code crashes a virtual machine but is fine on production, is it worth tracking the bug down? How many hours will you spend doing that?

Also, how is using a virtual machine better/worse/different from using something like mysqld_multi on a machine with many IPs, or other strategies folks use in dev/test so they don’t have to buy the exact same hardware as in production, but still have the same separation of databases, etc?

Statistics Gathering Script

I posted some code to the MySQL Forge at:

http://forge.mysql.com/snippets/view.php?id=52

Basically I gather some stats on my webservers in a very crude way. It runs a bunch of commands on the commandline (I run it every 5 minutes from a cron script). Please note that it adds to the load of the database server by running these commands, and it connects a few times to the database, so it actually increases things like “total # of connections” by the sheer fact that it runs.

This should run on most Unix machines, running most versions of MySQL. Please comment to this entry if you find something wrong or an incompatibility.

An explanation of the code follows:

This script will produce a string that looks like this:
(‘value’,’value’,’value’,123,’value’),
and all you have to do is add a line at the top: “INSERT INTO table (field1,field2,field3,field4,field5) VALUES”

and change the last comma to a semicolon. Then you will have something suitable for entering into a database, for later processing.

So here is an explanation of the lines of code, in order — the first few set variables:
#!/bin/sh
DB=db_here
USER=user_here
PASS=pass_here
HOST=`/bin/hostname | /bin/cut -f1 -d\.`

/bin/echo \(\'$HOST\',\'\

“echo” sets up the fact that you are printing all this out (later you >> to a file). \ is the escape character, and the line starts with “(“. Then we put the hostname in quotes, print a comma, and then an open quote. The final \ means “continue on the next line”, and it shows up on every line hereafter, so we can build a long string and then save it to a file.

`/bin/date +%Y-%m-%d\ %T`\',\

Putting “ around a command ensures that it gets run first, and then echo uses the output. Otherwise echo would simply print out “/bin/date +%Y…..”

This gets the date and time in MySQL’s date format (‘YYYY-MM-DD HH:MM:SS’), and puts in the ‘ to end the quoted field and a comma.

`/usr/bin/mysql -u $USER -p$PASS -e "\s"| /bin/awk '/Threads/ {print $2","$7","$20}'`,\

If you type \s at the database prompt, you will get some information that looks like this:


mysql> \s
--------------
mysql Ver 12.22 Distrib 4.0.20, for sun-solaris2.8 (sparc)
Connection id: 296486
Current database:
Current user: root@localhost
SSL: Not in use
Current pager: stdout
Using outfile: ''
Server version: 4.0.20-log
Protocol version: 10
Connection: Localhost via UNIX socket
Client characterset: latin1
Server characterset: latin1
UNIX socket: /tmp/mysql.sock
Uptime: 57 days 8 hours 6 min 25 sec
Threads: 1 Questions: 9331360 Slow queries: 32 Opens: 19372 Flush tables: 1 Open tables: 64 Queries per second avg: 1.884
--------------

This gets fields 2, 7 and 20 (fields are delimited by spaces) from the line marked “Threads” — field 2 is “1” relating to # of threads, field 7 is “32” which refers to the number of slow queries, and field 20 is “1.884” which referes to “Queries per second avg”. So this line gets 3 values. Since they are numbers, they do not need to be quoted*.

`cat /proc/loadavg | /bin/awk '{print $1}'`,\

/proc has some great information in it, but you have to use cat to view them. This gets the current load average, that appears as the first argument to “uptime”. Again, this is a number and does not need to be quoted.*

`cat /proc/meminfo | /bin/awk '/MemTotal|MemFree|SwapTotal|SwapFree/ {print $2","}'` \

This prints the 2nd field (and a comma separation) from /proc/meminfo from the lines that deal with total memory, free memory, total swap and free swap space. Again, these are numbers and do not require quoting.*

`/usr/bin/mysql -u $USER -p$PASS $DB -e "show table status\G" | /bin/awk '/InnoDB free:/ {print $4}' | /usr/bin/head -1`,\

This is a very hacky way to find how much space is left in the InnoDB tablespace. I use one set of InnoDB data files for all the InnoDB tables I have, so I can just take the free space from the first table, and that is OK. If you are using 1 file per table, you will want to get each value.

`/usr/bin/mysql -u $USER -p$PASS $DB -e "show status" | /bin/awk '/Bytes_received\W|Bytes_sent\W|Com_delete\W|Com_insert\W|Com_replace\W|Com_select\W|Com_update\W|Connections\W|Questions\W|Uptime\W/ {printf $2 ","}'`\

This gets lots of information from the SHOW STATUS command of MySQL. The Com_* commands are the counts, so Com_insert is the count of inserts (since the database instance started, so you can do Com_insert/Uptime to get inserts/second if you want).

Connections is the total number of connections (not to be confused with Threads, which is the current # of threads connected). Questions is total number of queries.

I wrote this in fits and starts, and above I parsed some output of “\s”. However, those numbers can be gotten by the “SHOW STATUS” variables of Slow_queries, Threads_created and Questions/Uptime. Because I am lazy, I have not changed my scripts, although you could decrease the amount of data MySQL sends and lessen the impact of this script with one less connection if you added it to this line.

There are TONS of variables in “SHOW STATUS” that are interesting and you might want to track. Decide what you like…and you can always add them in later!

`df -k /var | /bin/awk '/dev/ {print $2 "," $4}'`,\

This just looks at the space of /var, which is where my company keeps our mysql data files.

`df -k /var/log | /bin/awk '/dev/ {print $2 "," $4}'`,\

This just looks at the space of /var/log — where my company keeps our mysql logfiles.

`/usr/bin/mysql -u $USER -p$PASS $DB -e "SELECT COUNT(DISTINCT uid) FROM Sessions\G" | /bin/awk '/COUNT/ {print $3}' `,\

This gets information from a particular table in our system — it shows how many people are online. This is useful for us to gather because now we can see how the statistics we gather are related to how many people are logged in.

`/usr/bin/iostat -dk | /bin/awk '/sda/ {print $2","$5","$6}'`\),\

This requires the “sysstat” package, and this line retrieves the I/O (disk) transactions per second, the KB read and the KB written to the disk.

>> /home/care/`echo $HOST`perf.csv

And this is the last line — finally, at last, we save it to a file.

PLEASE feel free to modify this and make it cleaner……obviously you will want to apply it to your own situation and what you want to gather.

I run this as a local script on each database machine, because it gets information using the commandline. I save everything to a file and do a weekly import. Theoretically I could have it save directly to a reporting database, but I do not want any dependence on network, just in case. You may want to have the process more automatic.

Weekly I have a cron job to “rotate” the .csv file: copy the file locally, copy it to a remote server, and then delete all but the first line (the first line being “INSERT INTO tbl VALUES(…..”).

(for those interested, this is how I delete all but the first line. Oh so hacky….)
HOST=`/bin/hostname | /bin/cut -f1 -d\.`
head -1 /home/care/`echo $HOST`perf.csv > /tmp/temp; mv /tmp/temp /home/care/`echo $HOST`perf.csv;

And for those who care, the following is the first line of my file:

REPLACE INTO dbstats (serverName,statsTime,procs,slowQueries,queriespersec,serverLoad,MemTotal
,MemFree,SwapTotal,SwapFree,ibFree,bytesRec,bytesSent,deletes,inserts,replaces,selects,updates
,connections,queries,mysqlUptime,sizeKBvar,availKBvar,sizeKBvarlog,availKBvarlog,Sessions,tps,
kbRead,kbWrtn) VALUES


* while numbers do not need to be quoted, if there’s ever a chance that the number may end up blank instead of 0, it should be quoted, as MySQL does not handle “VALUES (‘foo’,,’bar’)”
and throws an error.

I posted some code to the MySQL Forge at:

http://forge.mysql.com/snippets/view.php?id=52

Basically I gather some stats on my webservers in a very crude way. It runs a bunch of commands on the commandline (I run it every 5 minutes from a cron script). Please note that it adds to the load of the database server by running these commands, and it connects a few times to the database, so it actually increases things like “total # of connections” by the sheer fact that it runs.

This should run on most Unix machines, running most versions of MySQL. Please comment to this entry if you find something wrong or an incompatibility.

An explanation of the code follows:
Continue reading “Statistics Gathering Script”

Option and Variable Matrix

SHOW STATUS and SHOW VARIABLES are very useful features for analyzing server performance, server setup and general debugging. Many times, there are items applicable to a scenario in both lists. Further, sometimes it’s difficult to know whether an option or variable can be changed on the commandline or in an option file.

For example, let’s say you are debugging a lot of aborted connects. You may care about the following items in “SHOW STATUS”:

Aborted_connects
Connections
Max_used_connections
Threads_connected
Uptime

And the following items in SHOW VARIABLES:

connect_timeout
init_connect
max_connect_errors
max_connections
max_user_connections

You might compare max_connections to max_used_connections, for example. Let’s say you realize you want to increase max_connections from 100 to 200 — can you do it dynamically on the commandline? Do you have to put it in an option file and restart?

Now MySQL has a matrix of variables and options that easily shows what can be changed where:

http://dev.mysql.com/doc/refman/5.0/en/mysqld-option-tables.html

As a bonus, many of the items in the list have links to the section that explains what they do.

This is a great reference — the matrix is only for 5.0 and up, but most of it likely applies to below 5.0 — just that some options may not be available. At any rate, you can use the 5.0 matrix, click through to the variable/option’s description, and then use the link on the left-hand side to go to the “3.23, 4.0, 4.1” documentation, which will give you the correct information for that version.

I know this will save me much time.

SHOW STATUS and SHOW VARIABLES are very useful features for analyzing server performance, server setup and general debugging. Many times, there are items applicable to a scenario in both lists. Further, sometimes it’s difficult to know whether an option or variable can be changed on the commandline or in an option file.

For example, let’s say you are debugging a lot of aborted connects. You may care about the following items in “SHOW STATUS”:

Aborted_connects
Connections
Max_used_connections
Threads_connected
Uptime

And the following items in SHOW VARIABLES:

connect_timeout
init_connect
max_connect_errors
max_connections
max_user_connections

You might compare max_connections to max_used_connections, for example. Let’s say you realize you want to increase max_connections from 100 to 200 — can you do it dynamically on the commandline? Do you have to put it in an option file and restart?

Now MySQL has a matrix of variables and options that easily shows what can be changed where:

http://dev.mysql.com/doc/refman/5.0/en/mysqld-option-tables.html

As a bonus, many of the items in the list have links to the section that explains what they do.

This is a great reference — the matrix is only for 5.0 and up, but most of it likely applies to below 5.0 — just that some options may not be available. At any rate, you can use the 5.0 matrix, click through to the variable/option’s description, and then use the link on the left-hand side to go to the “3.23, 4.0, 4.1” documentation, which will give you the correct information for that version.

I know this will save me much time.

Good Ideas Implemented Poorly

Please share good ideas implemented poorly you have seen.

For example, in the past 24 hours I have come across 2 such beasts:

1) A company doing vertical scaling based on username. Unsurprisingly, they found that 70% of their users had usernames beginning with 1 of 2 letters. A better idea? Do it by last 2 digits of a user id. That way, even though users are created and perhaps deleted all the time, the distribution is effectively random. Language is not.

With 2 significant figures, you can divide vertically many different ways — 1 large table of all the uids, 2 tables of uids ending in 00-49 and 50-99, 4 tables, 10 tables, 25 tables, 50 tables, 100 tables. This eliminates the need for a separate table for each user.

For example, the online dating site I work for uses this method for messaging. So a user’s inbox is just one part of a table. If the user has uid 12345, their “Inbox” is the result of SELECT * FROM Inbox_45 WHERE toUid=12345. Their “Sent” box is the result of SELECT * FROM Inbox_45 WHERE fromUid=12345. We have over 650,000 active users; I can only imagine what horrors we would have if we had 1 table for each person’s mailbox (much less 1 each for Trash, Sent, Saved and Inbox).

This also helps when using InnoDB tables, as we are — the count(*) query to find the number of messages in your boxes uses the indexes on toUid and fromUid on the tables. 1 table per uid would not take advantage of this index, although it would probably be a bit faster in terms of retrieval. We kept this idea in mind — if we had one table per uid, we probably could use MyISAM tables, and counts would be faster. But again, horrors with 650,000 tables!

Also, you want to keep your code as flexible as possible. Do not hard code which servers have which tables — use SHOW TABLES in scripts and such. We truncate trashes nightly, and simply have a loop that says:


(pseudocode)
for i in "show tables"
do
truncate Trash_$i
done

2) The lead developer on a new project had the rule of “each table has a class”, which led another developer to come to me asking to resolve a problem. You guessed it — tables in different databases that have the same name. We have a static database, with a table called “Preferences”. This is a joining table, that says “preference #1 is always blah”. And in our profiles database for a site, we also have a “Preferences” table — this is where we associate users with their preferences.

The goal is world domination to be able to have multiple sites use similar codebases and data schema. Therefore, it is perfectly reasonable to have databases with identical schemas and different names. Site1 can have an identical schema and codebase to Site2, with storing the data in the “Site1” and “Site2” databases. The reason to have 2 separate databases is that we want to be avoid having to put which site the data refers to as a part of each row.

Any other examples of good ideas implemented poorly? Feel free to make an entry, but if it’s not on PlanetMySQL.org please put a link in the comments so I’m sure to actually read it.

Followers may note I fell way short of my goal of 1 post per day in May; in fact, I had 15 posts in May, not the 31 I had hoped. I’m OK with that, and I apologize for the long radio silence. In the meantime, I had a wonderful few weeks, including biking on the island of Martha’s Vineyard, and a wonderful phone conversation with Frank Mash last night.

If anyone cares to know, my Skype name is awfief. I’m happy to talk to folks; I’m GMT -5 (Eastern Standard Time, Daylight Savings when appropriate). I also have a tendency to leave the program on while I’m in the room, so feel free to ring (instead of Skype Chat) to see if I’m around.

Please share good ideas implemented poorly you have seen.

For example, in the past 24 hours I have come across 2 such beasts:

1) A company doing vertical scaling based on username. Unsurprisingly, they found that 70% of their users had usernames beginning with 1 of 2 letters. A better idea? Do it by last 2 digits of a user id. That way, even though users are created and perhaps deleted all the time, the distribution is effectively random. Language is not.

With 2 significant figures, you can divide vertically many different ways — 1 large table of all the uids, 2 tables of uids ending in 00-49 and 50-99, 4 tables, 10 tables, 25 tables, 50 tables, 100 tables. This eliminates the need for a separate table for each user.

For example, the online dating site I work for uses this method for messaging. So a user’s inbox is just one part of a table. If the user has uid 12345, their “Inbox” is the result of SELECT * FROM Inbox_45 WHERE toUid=12345. Their “Sent” box is the result of SELECT * FROM Inbox_45 WHERE fromUid=12345. We have over 650,000 active users; I can only imagine what horrors we would have if we had 1 table for each person’s mailbox (much less 1 each for Trash, Sent, Saved and Inbox).

This also helps when using InnoDB tables, as we are — the count(*) query to find the number of messages in your boxes uses the indexes on toUid and fromUid on the tables. 1 table per uid would not take advantage of this index, although it would probably be a bit faster in terms of retrieval. We kept this idea in mind — if we had one table per uid, we probably could use MyISAM tables, and counts would be faster. But again, horrors with 650,000 tables!

Also, you want to keep your code as flexible as possible. Do not hard code which servers have which tables — use SHOW TABLES in scripts and such. We truncate trashes nightly, and simply have a loop that says:


(pseudocode)
for i in "show tables"
do
truncate Trash_$i
done

2) The lead developer on a new project had the rule of “each table has a class”, which led another developer to come to me asking to resolve a problem. You guessed it — tables in different databases that have the same name. We have a static database, with a table called “Preferences”. This is a joining table, that says “preference #1 is always blah”. And in our profiles database for a site, we also have a “Preferences” table — this is where we associate users with their preferences.

The goal is world domination to be able to have multiple sites use similar codebases and data schema. Therefore, it is perfectly reasonable to have databases with identical schemas and different names. Site1 can have an identical schema and codebase to Site2, with storing the data in the “Site1” and “Site2” databases. The reason to have 2 separate databases is that we want to be avoid having to put which site the data refers to as a part of each row.

Any other examples of good ideas implemented poorly? Feel free to make an entry, but if it’s not on PlanetMySQL.org please put a link in the comments so I’m sure to actually read it.

Followers may note I fell way short of my goal of 1 post per day in May; in fact, I had 15 posts in May, not the 31 I had hoped. I’m OK with that, and I apologize for the long radio silence. In the meantime, I had a wonderful few weeks, including biking on the island of Martha’s Vineyard, and a wonderful phone conversation with Frank Mash last night.

If anyone cares to know, my Skype name is awfief. I’m happy to talk to folks; I’m GMT -5 (Eastern Standard Time, Daylight Savings when appropriate). I also have a tendency to leave the program on while I’m in the room, so feel free to ring (instead of Skype Chat) to see if I’m around.

InnoDB information

SHOW ENGINE INNODB STATUS shows current InnoDB status, including deadlocks. I was thinking a while back it would be nice to have a log of all deadlocks. Well, I got my answer:

http://dev.mysql.com/doc/refman/5.0/en/innodb-monitor.html

To cause the standard InnoDB Monitor to write to the standard output of mysqld, use the following SQL statement:

CREATE TABLE innodb_monitor (a INT) ENGINE=INNODB;

The monitor can be stopped by issuing the following statement:

DROP TABLE innodb_monitor;

SHOW ENGINE INNODB STATUS shows current InnoDB status, including deadlocks. I was thinking a while back it would be nice to have a log of all deadlocks. Well, I got my answer:

http://dev.mysql.com/doc/refman/5.0/en/innodb-monitor.html

To cause the standard InnoDB Monitor to write to the standard output of mysqld, use the following SQL statement:

CREATE TABLE innodb_monitor (a INT) ENGINE=INNODB;

The monitor can be stopped by issuing the following statement:

DROP TABLE innodb_monitor;