Top 5 MySQL Community Wishes

As the 2007 Community Advocate of the Year, I’m taking the “MySQL 5 Wishes” meme and changing it a bit. I hope y’all don’t mind:

1) Everyone has a different level of familiarity. The community does well with this when writing articles, for instance cross-referencing older articles, linking to documentation, the MySQL Forge, etc. Not much background information other than “MySQL usage” is assumed.

However, where we fall down is when we aggregate some writings and call it documentation. The worst form of this is a tool that grows organically, from “look, here’s a script!” to a full-blown tool/patch/add-on. Sourceforge stinks for trying to make documentation, so most folks just link to their posts tagged “mytool” or whatever the name is.

Using some marketing skills would be wonderful — make a page for folks who have never seen one post about it. Voila, you get your code going from something that people only learn when someone else tells them, to something folks wind up getting as a result of a search.

2) Along those lines, MySQL provides us with some great tools that we rarely use. When was the last time you linked your presentation to the MySQL Forge Wiki at http://forge.mysql.com/wiki/Main_Page? It took me a long time to make Technocation’s MySQL 2007 Conference Video page at http://technocation.org/content/2007-mysql-user-conference-and-expo-presentations-and-videos — Even after all the video was edited, I had to make the page.

How much easier would it have been if the descriptions, slides, handouts, video and audio were all available in one place? Obviously we can’t hack on the O’Reilly site, but there’s nothing to say that we can’t make a wiki site with everything about a presentation in one place — including links to everyone’s notes! Make it so that 5 years from now a person learning MySQL can find what they need, when they don’t have the same time/date context that we do.

3) Use (and appreciate) what we have. We have great software, sure. But we also have a company full of folks willing to talk to us. We can complain about the fact that even simple patches from non-employees take several months or a year or so to get into the code, because of existing coding conventions, etc. We can be annoyed that we have to download 7 addons for our software, but instead of saying MySQL should offer them for download in the same package (which of course they should, all the code should integrate nicely, and we should be able to turn on features we want and turn off or not use those we don’t)…….

….we can help that by making a centralized repository of MySQL addons. Run by the community, for the community. On the forge. At the very least we can make an index page of the neat tools we’ve created or found for MySQL and categorize them. Think of how plugins for software such as Firefox have repositories.

4) Volunteer unexpectedly. Got a presentation that didn’t make the cut for the 2007 MySQL Users Conference? Offer to present it at a local user group. Don’t have a local user group? Record the presentation as a lecture and post it online. Alternatively, make a local user group. Do what you’re mostly comfortable with — don’t always stay in your comfort zone, push it a little. Maybe it means volunteering to help the MySQL documentation get a bit better. Contact someone you know in MySQL (or just put the word out in a blog post) that you’d like to help _________ get better, and you’re sure to find a few takers.

5) Contribute! OK, many already do this at http://www.planetmysql.org. But consider contributing to:

As the 2007 Community Advocate of the Year, I’m taking the “MySQL 5 Wishes” meme and changing it a bit. I hope y’all don’t mind:

1) Everyone has a different level of familiarity. The community does well with this when writing articles, for instance cross-referencing older articles, linking to documentation, the MySQL Forge, etc. Not much background information other than “MySQL usage” is assumed.

However, where we fall down is when we aggregate some writings and call it documentation. The worst form of this is a tool that grows organically, from “look, here’s a script!” to a full-blown tool/patch/add-on. Sourceforge stinks for trying to make documentation, so most folks just link to their posts tagged “mytool” or whatever the name is.

Using some marketing skills would be wonderful — make a page for folks who have never seen one post about it. Voila, you get your code going from something that people only learn when someone else tells them, to something folks wind up getting as a result of a search.

2) Along those lines, MySQL provides us with some great tools that we rarely use. When was the last time you linked your presentation to the MySQL Forge Wiki at http://forge.mysql.com/wiki/Main_Page? It took me a long time to make Technocation’s MySQL 2007 Conference Video page at http://technocation.org/content/2007-mysql-user-conference-and-expo-presentations-and-videos — Even after all the video was edited, I had to make the page.

How much easier would it have been if the descriptions, slides, handouts, video and audio were all available in one place? Obviously we can’t hack on the O’Reilly site, but there’s nothing to say that we can’t make a wiki site with everything about a presentation in one place — including links to everyone’s notes! Make it so that 5 years from now a person learning MySQL can find what they need, when they don’t have the same time/date context that we do.

3) Use (and appreciate) what we have. We have great software, sure. But we also have a company full of folks willing to talk to us. We can complain about the fact that even simple patches from non-employees take several months or a year or so to get into the code, because of existing coding conventions, etc. We can be annoyed that we have to download 7 addons for our software, but instead of saying MySQL should offer them for download in the same package (which of course they should, all the code should integrate nicely, and we should be able to turn on features we want and turn off or not use those we don’t)…….

….we can help that by making a centralized repository of MySQL addons. Run by the community, for the community. On the forge. At the very least we can make an index page of the neat tools we’ve created or found for MySQL and categorize them. Think of how plugins for software such as Firefox have repositories.

4) Volunteer unexpectedly. Got a presentation that didn’t make the cut for the 2007 MySQL Users Conference? Offer to present it at a local user group. Don’t have a local user group? Record the presentation as a lecture and post it online. Alternatively, make a local user group. Do what you’re mostly comfortable with — don’t always stay in your comfort zone, push it a little. Maybe it means volunteering to help the MySQL documentation get a bit better. Contact someone you know in MySQL (or just put the word out in a blog post) that you’d like to help _________ get better, and you’re sure to find a few takers.

5) Contribute! OK, many already do this at http://www.planetmysql.org. But consider contributing to:

How Much of a MySQL Geek Am I?

So, this is me:

Special thanx to Colin Charles for taking the picture and linking to it from his blog.

Notice that in addition to my photogenic qualities as well as the bags under my eyes, that I’m wearing an incredibly geeky necklace.

Yes, it’s true. I bought a white gold dolphin to wear around my neck, because I am THAT much of a MySQL geek.

So, this is me:

Special thanx to Colin Charles for taking the picture and linking to it from his blog.

Notice that in addition to my photogenic qualities as well as the bags under my eyes, that I’m wearing an incredibly geeky necklace.

Yes, it’s true. I bought a white gold dolphin to wear around my neck, because I am THAT much of a MySQL geek.

OurSQL Episode 17: Hashing it out

In this episode we tackle what a hash looks like in terms of a data structure, in preparation for next episode’s discussion on the difference between hashes and btree indexes, and what kind of indexes are good for what kind of optimizations.

Show Notes:
Direct play this episode at:
http://technocation.org/content/oursql-episode-17%3A-hashing-it-out-0

Download all podcasts at:
http://technocation.org/podcasts/oursql/

Subscribe to the podcast at:
http://feeds.feedburner.com/oursql

News:
MySQL Connector/NET 5.1.1 released:
http://tinyurl.com/23a9ax

Download the new Connector/NET version:
http://dev.mysql.com/downloads/connector/net/5.1.html

MySQL 5.0.x security vulnerability:
http://bugs.mysql.com/bug.php?id=27513
Solution: upgrade to 5.0.40. This bug is not known to affect major versions 3 or 4.

Learning Resource:

http://onlinesolutionsmysql.blogspot.com/

The dates for the all the sessions:

* 27th March: Part 1 – High Availability and Scalability Architectures
* 19th April: Part 2 – Advanced Scalability Solutions
* 2nd May: Part 3 – MySQL Enterprise To Control Mission Critical Online Services
* 23rd May: Part 4 – 99.999% High Availability solutions
* 13th June: Part 5 – MySQL Enterprise performance and benchmarking
* 27th June: Part 6 – Advanced HA solutions

Find all the material and documentation for past webinars at:
http://onlinesolutionsmysql.blogspot.com/2007/03/links-to-material-and-documentation.html

Feature: Hash tables explained.

http://www.sparknotes.com/cs/searching/hashtables/section1.html

http://www.cs.sunysb.edu/~algorith/lectures-good/node7.html (search for “Hash Tables” on the page)

Feedback:

Email podcast@technocation.org

call the comment line at +1 617-674-2369

use Odeo to leave a voice mail through your computer:
http://odeo.com/sendmeamessage/Sheeri

Or use the Technocation forums:
http://technocation.org/forum

In this episode we tackle what a hash looks like in terms of a data structure, in preparation for next episode’s discussion on the difference between hashes and btree indexes, and what kind of indexes are good for what kind of optimizations.

Show Notes:
Direct play this episode at:
http://technocation.org/content/oursql-episode-17%3A-hashing-it-out-0

Download all podcasts at:
http://technocation.org/podcasts/oursql/

Subscribe to the podcast at:
http://feeds.feedburner.com/oursql

News:
MySQL Connector/NET 5.1.1 released:
http://tinyurl.com/23a9ax

Download the new Connector/NET version:
http://dev.mysql.com/downloads/connector/net/5.1.html

MySQL 5.0.x security vulnerability:
http://bugs.mysql.com/bug.php?id=27513
Solution: upgrade to 5.0.40. This bug is not known to affect major versions 3 or 4.

Learning Resource:

http://onlinesolutionsmysql.blogspot.com/

The dates for the all the sessions:

* 27th March: Part 1 – High Availability and Scalability Architectures
* 19th April: Part 2 – Advanced Scalability Solutions
* 2nd May: Part 3 – MySQL Enterprise To Control Mission Critical Online Services
* 23rd May: Part 4 – 99.999% High Availability solutions
* 13th June: Part 5 – MySQL Enterprise performance and benchmarking
* 27th June: Part 6 – Advanced HA solutions

Find all the material and documentation for past webinars at:
http://onlinesolutionsmysql.blogspot.com/2007/03/links-to-material-and-documentation.html

Feature: Hash tables explained.

http://www.sparknotes.com/cs/searching/hashtables/section1.html

http://www.cs.sunysb.edu/~algorith/lectures-good/node7.html (search for “Hash Tables” on the page)

Feedback:

Email podcast@technocation.org

call the comment line at +1 617-674-2369

use Odeo to leave a voice mail through your computer:
http://odeo.com/sendmeamessage/Sheeri

Or use the Technocation forums:
http://technocation.org/forum

SQL performance tips for podcast

Things to avoid (config)
10. –skip-name-resolve
13. increase temp table size in a data warehousing environment (default is 32Mb) so it doesn’t write to disk (also constrained by max_heap_table_size, default 16Mb)
4. if you can, compress text/blobs
5. compress static data
6. don’t back up static data as often

Things to avoid: schema
12. Separate text/blobs from metadata, don’t put text/blobs in results if you don’t need them
18. Redundant data is redundant

Top 1000 SQL Performance Tips

Interactive session from MySQL Camp I:

Specific Query Performance Tips (see also database design tips for tips on indexes):

1. Use EXPLAIN to profile the query execution plan
2. Use Slow Query Log (always have it on!)
3. Don’t use DISTINCT when you have or could use GROUP BY
4. Insert performance
1. Batch INSERT and REPLACE
2. Use LOAD DATA instead of INSERT
5. LIMIT m,n may not be as fast as it sounds
6. Don’t use ORDER BY RAND() if you have > ~2K records
13. Derived tables (subqueries in the FROM clause) can be useful for retrieving BLOBs without sorting them. (Self-join can speed up a query if 1st part finds the IDs and uses then to fetch the rest)
15. Know when to split a complex query and join smaller ones
16. Delete small amounts at a time if you can
18. Have good SQL query standards
19. Don’t use deprecated features

Scaling Performance Tips:

1. Use benchmarking
2. isolate workloads don’t let administrative work interfere with customer performance. (ie backups)
3. Debugging sucks, testing rocks!

Network Performance Tips:

1. Minimize traffic by fetching only what you need.
1. Paging/chunked data retrieval to limit
3. Be wary of lots of small quick queries if a longer query can be more efficient
2. Use multi_query if appropriate to reduce round-trips

OS Performance Tips:

1. Use proper data partitions
1. For Cluster. Start thinking about Cluster *before* you need them
2. Keep the database host as clean as possible. Do you really need a windowing system on that server?
3. Utilize the strengths of the OS
4. pare down cron scripts
5. create a test environment
6. source control schema and config files
7. for LVM innodb backups, restore to a different instance of MySQL so Innodb can roll forward
8. partition appropriately
9. partition your database when you have real data — do not assume you know your dataset until you have real data

MySQL Server Overall Tips:

1. innodb_flush_commit=0 can help slave lag
2. Optimize for data types, use consistent data types. Use PROCEDURE ANALYSE() to help determine the smallest data type for your needs.
3. use optimistic locking, not pessimistic locking. try to use shared lock, not exclusive lock. share mode vs. FOR UPDATE
8. config params — http://docs.cellblue.nl/easy_mysql_performance_tweaks/ is a good reference
9. Config variables & tips:
1. use one of the supplied config files
2. key_buffer, unix cache (leave some RAM free), per-connection variables, innodb memory variables
3. be aware of global vs. per-connection variables
4. check SHOW STATUS and SHOW VARIABLES (GLOBAL|SESSION in 5.0 and up)
5. be aware of swapping esp. with Linux, “swappiness” (bypass OS filecache for innodb data files, innodb_flush_method=O_DIRECT if possible (this is also OS specific))
6. defragment tables, rebuild indexes, do table maintenance
7. If you use innodb_flush_txn_commit=1, use a battery-backed hardware cache write controller
8. more RAM is good so faster disk speed
9. use 64-bit architectures
11. increase myisam_sort_buffer_size to optimize large inserts (this is a per-connection variable)
12. (look up) memory tuning parameter for on-insert caching
14. Run in SQL_MODE=STRICT to help identify warnings
15. /tmp dir on battery-backed write cache
16. consider battery-backed RAM for innodb logfiles
17. use –safe-updates for client

Storage Engine Performance Tips:
2. Utilize different storage engines on master/slave ie, if you need fulltext indexing on a table.
3. BLACKHOLE engine and replication is much faster than FEDERATED tables for things like logs.
4. Know your storage engines and what performs best for your needs, know that different ones exist.
1. ie, use MERGE tables ARCHIVE tables for logs
2. Archive old data — don’t be a pack-rat! 2 common engines for this are ARCHIVE tables and MERGE tables
5. use row-level instead of table-level locking for OLTP workloads
6. try out a few schemas and storage engines in your test environment before picking one.

Database Design Performance Tips:

1. Design sane query schemas. don’t be afraid of table joins, often they are faster than denormalization
2. Don’t use boolean flags
8. Use a clever key and ORDER BY instead of MAX
9. Normalize first, and denormalize where appropriate.
10. Databases are not spreadsheets, even though Access really really looks like one. Then again, Access isn’t a real database

11. use INET_ATON and INET_NTOA for IP addresses, not char or varchar
12. make it a habit to REVERSE() email addresses, so you can easily search domains (this will help avoid wildcards at the start of LIKE queries if you want to find everyone whose e-mail is in a certain domain)
13. A NULL data type can take more room to store than NOT NULL
14. Choose appropriate character sets & collations — UTF16 will store each character in 2 bytes, whether it needs it or not, latin1 is faster than UTF8.
15. Use Triggers wisely
16. use min_rows and max_rows to specify approximate data size so space can be pre-allocated and reference points can be calculated.
18. Use myisam_pack_keys for int data
19. be able to change your schema without ruining functionality of your code
20. segregate tables/databases that benefit from different configuration variables

Other:

1. Hire a MySQL ™ Certified DBA
2. Know that there are many consulting companies out there that can help, as well as MySQL’s Professional Services.
3. Read and post to MySQL Planet at http://www.mysqlplanet.org
4. Attend the yearly MySQL Conference and Expo or other conferences with MySQL tracks (link to the conference here)
5. Support your local User Group (link to forge page w/user groups here)

Rebuild indexes and why
20. Turning OR on multiple index fields (<5.0) into UNION may speed things up (with LIMIT), after 5.0 the index_merge should pick stuff up.
UNION was introduced in MySQL 4.0.
11. ORDER BY and LIMIT work best with equalities and covered indexes
1. InnoDB ALWAYS keeps the primary key as part of each index, so do not make the primary key very large
3. Use Indexes
4. Don’t Index Everything
5. Do not duplicate indexes
6. Do not use large columns in indexes if the ratio of SELECTs:INSERTs is low.
7. be careful of redundant columns in an index or across indexes
14. ALTER TABLE…ORDER BY can take data sorted chronologically and re-order it by a different field — this can make queries on that field run faster (maybe this goes in indexing?)
4. As your data grows, indexing may change (cardinality and selectivity change). Structuring may want to change. Make your schema as modular as your code. Make your code able to scale. Plan and embrace change, and get developers to do the same.
17. Use HASH indexing for indexing across columns with similar data prefixes

Used in Episode 2:
8. Avoid wildcards at the start of LIKE queries
21. Don’t use COUNT * on Innodb tables for every search, do it a few times and/or summary tables, or if you need it for the total # of rows, use SQL_CALC_FOUND_ROWS and SELECT FOUND_ROWS()
2. Don’t use SELECT *

OurSQL Episode 22: Things to avoid

(in queries):
9. Avoid correlated subqueries and in select and where clause (try to avoid in)
23. use groupwise maximum instead of subqueries
10. No calculated comparisons — isolate indexed columns
22. Use INSERT … ON DUPLICATE KEY update (INSERT IGNORE) to avoid having to SELECT

Things to avoid (config)
10. –skip-name-resolve
13. increase temp table size in a data warehousing environment (default is 32Mb) so it doesn’t write to disk (also constrained by max_heap_table_size, default 16Mb)
4. if you can, compress text/blobs
5. compress static data
6. don’t back up static data as often

Things to avoid: schema
12. Separate text/blobs from metadata, don’t put text/blobs in results if you don’t need them
18. Redundant data is redundant

Top 1000 SQL Performance Tips

Interactive session from MySQL Camp I:

Specific Query Performance Tips (see also database design tips for tips on indexes):

1. Use EXPLAIN to profile the query execution plan
2. Use Slow Query Log (always have it on!)
3. Don’t use DISTINCT when you have or could use GROUP BY
4. Insert performance
1. Batch INSERT and REPLACE
2. Use LOAD DATA instead of INSERT
5. LIMIT m,n may not be as fast as it sounds
6. Don’t use ORDER BY RAND() if you have > ~2K records
13. Derived tables (subqueries in the FROM clause) can be useful for retrieving BLOBs without sorting them. (Self-join can speed up a query if 1st part finds the IDs and uses then to fetch the rest)
15. Know when to split a complex query and join smaller ones
16. Delete small amounts at a time if you can
18. Have good SQL query standards
19. Don’t use deprecated features

Scaling Performance Tips:

1. Use benchmarking
2. isolate workloads don’t let administrative work interfere with customer performance. (ie backups)
3. Debugging sucks, testing rocks!

Network Performance Tips:

1. Minimize traffic by fetching only what you need.
1. Paging/chunked data retrieval to limit
3. Be wary of lots of small quick queries if a longer query can be more efficient
2. Use multi_query if appropriate to reduce round-trips

OS Performance Tips:

1. Use proper data partitions
1. For Cluster. Start thinking about Cluster *before* you need them
2. Keep the database host as clean as possible. Do you really need a windowing system on that server?
3. Utilize the strengths of the OS
4. pare down cron scripts
5. create a test environment
6. source control schema and config files
7. for LVM innodb backups, restore to a different instance of MySQL so Innodb can roll forward
8. partition appropriately
9. partition your database when you have real data — do not assume you know your dataset until you have real data

MySQL Server Overall Tips:

1. innodb_flush_commit=0 can help slave lag
2. Optimize for data types, use consistent data types. Use PROCEDURE ANALYSE() to help determine the smallest data type for your needs.
3. use optimistic locking, not pessimistic locking. try to use shared lock, not exclusive lock. share mode vs. FOR UPDATE
8. config params — http://docs.cellblue.nl/easy_mysql_performance_tweaks/ is a good reference
9. Config variables & tips:
1. use one of the supplied config files
2. key_buffer, unix cache (leave some RAM free), per-connection variables, innodb memory variables
3. be aware of global vs. per-connection variables
4. check SHOW STATUS and SHOW VARIABLES (GLOBAL|SESSION in 5.0 and up)
5. be aware of swapping esp. with Linux, “swappiness” (bypass OS filecache for innodb data files, innodb_flush_method=O_DIRECT if possible (this is also OS specific))
6. defragment tables, rebuild indexes, do table maintenance
7. If you use innodb_flush_txn_commit=1, use a battery-backed hardware cache write controller
8. more RAM is good so faster disk speed
9. use 64-bit architectures
11. increase myisam_sort_buffer_size to optimize large inserts (this is a per-connection variable)
12. (look up) memory tuning parameter for on-insert caching
14. Run in SQL_MODE=STRICT to help identify warnings
15. /tmp dir on battery-backed write cache
16. consider battery-backed RAM for innodb logfiles
17. use –safe-updates for client

Storage Engine Performance Tips:
2. Utilize different storage engines on master/slave ie, if you need fulltext indexing on a table.
3. BLACKHOLE engine and replication is much faster than FEDERATED tables for things like logs.
4. Know your storage engines and what performs best for your needs, know that different ones exist.
1. ie, use MERGE tables ARCHIVE tables for logs
2. Archive old data — don’t be a pack-rat! 2 common engines for this are ARCHIVE tables and MERGE tables
5. use row-level instead of table-level locking for OLTP workloads
6. try out a few schemas and storage engines in your test environment before picking one.

Database Design Performance Tips:

1. Design sane query schemas. don’t be afraid of table joins, often they are faster than denormalization
2. Don’t use boolean flags
8. Use a clever key and ORDER BY instead of MAX
9. Normalize first, and denormalize where appropriate.
10. Databases are not spreadsheets, even though Access really really looks like one. Then again, Access isn’t a real database

11. use INET_ATON and INET_NTOA for IP addresses, not char or varchar
12. make it a habit to REVERSE() email addresses, so you can easily search domains (this will help avoid wildcards at the start of LIKE queries if you want to find everyone whose e-mail is in a certain domain)
13. A NULL data type can take more room to store than NOT NULL
14. Choose appropriate character sets & collations — UTF16 will store each character in 2 bytes, whether it needs it or not, latin1 is faster than UTF8.
15. Use Triggers wisely
16. use min_rows and max_rows to specify approximate data size so space can be pre-allocated and reference points can be calculated.
18. Use myisam_pack_keys for int data
19. be able to change your schema without ruining functionality of your code
20. segregate tables/databases that benefit from different configuration variables

Other:

1. Hire a MySQL ™ Certified DBA
2. Know that there are many consulting companies out there that can help, as well as MySQL’s Professional Services.
3. Read and post to MySQL Planet at http://www.mysqlplanet.org
4. Attend the yearly MySQL Conference and Expo or other conferences with MySQL tracks (link to the conference here)
5. Support your local User Group (link to forge page w/user groups here)

Rebuild indexes and why
20. Turning OR on multiple index fields (<5.0) into UNION may speed things up (with LIMIT), after 5.0 the index_merge should pick stuff up.
UNION was introduced in MySQL 4.0.
11. ORDER BY and LIMIT work best with equalities and covered indexes
1. InnoDB ALWAYS keeps the primary key as part of each index, so do not make the primary key very large
3. Use Indexes
4. Don’t Index Everything
5. Do not duplicate indexes
6. Do not use large columns in indexes if the ratio of SELECTs:INSERTs is low.
7. be careful of redundant columns in an index or across indexes
14. ALTER TABLE…ORDER BY can take data sorted chronologically and re-order it by a different field — this can make queries on that field run faster (maybe this goes in indexing?)
4. As your data grows, indexing may change (cardinality and selectivity change). Structuring may want to change. Make your schema as modular as your code. Make your code able to scale. Plan and embrace change, and get developers to do the same.
17. Use HASH indexing for indexing across columns with similar data prefixes

Used in Episode 2:
8. Avoid wildcards at the start of LIKE queries
21. Don’t use COUNT * on Innodb tables for every search, do it a few times and/or summary tables, or if you need it for the total # of rows, use SQL_CALC_FOUND_ROWS and SELECT FOUND_ROWS()
2. Don’t use SELECT *

OurSQL Episode 22: Things to avoid

(in queries):
9. Avoid correlated subqueries and in select and where clause (try to avoid in)
23. use groupwise maximum instead of subqueries
10. No calculated comparisons — isolate indexed columns
22. Use INSERT … ON DUPLICATE KEY update (INSERT IGNORE) to avoid having to SELECT

Part 2: Data Warehousing Tips and Tricks

Ask and you shall receive: http://face.centosprime.com/rdb-w/?p=68 linked to my previous post on the Data Warehousing Tips and Tricks session (http://sheeri.net/archives/204) with the comment, “I need to learn more about MERGE TABLES and INSERT … ON DUPLICATE KEY UPDATE“.

So here’s a bit more:

The manual pages for the MERGE storage engine:
http://dev.mysql.com/doc/refman/5.0/en/merge-storage-engine.html
and
http://dev.mysql.com/doc/refman/5.0/en/merge-table-problems.html

MySQL Forums for the MERGE talbe are at:
http://forums.mysql.com/list.php?93

In a nutshell, a MERGE table is really a set of pointers to similarly-schema’d MyISAM tables. So if you have the same table schema multiple times (ie, partition per day, so you have tables named 2007_04_27_Sales, 2007_04_26_Sales, etc) you’d use a MERGE table to link them all together and then you can run a query on the MERGE table and it will query all the tables that the MERGE table points to.

As for INSERT . . . ON DUPLICATE KEY UPDATE

MySQL gives many ways to deal with INSERTs and unique/primary keys. If you do an INSERT and the primary key you are trying to insert is already in the table, MySQL will give an error. Ways to deal with this:

1) Try & catch errors in the application code.

2) Use INSERT IGNORE INTO . . . this will insert a new record if a record with the key does not exist. If it does exist, nothing happens. Simply add the word “IGNORE” into your INSERT query after INSERT and before INTO.

3) Use REPLACE INTO . . .this will insert a new record if a record with the key does not exist. If a record does exist, MySQL will *delete* the record and then INSERT your record. This can cause problems when you just want to update part of a row, and not insert the whole row again. And it changes timestamps and auto-increment numbers, which may not be a desired result. Simply change the word “INSERT” in your query to “REPLACE”.

4) Use INSERT . . .ON DUPLICATE KEY UPDATE. The syntax is the regular INSERT statement, and at the end add ON DUPLICATE KEY UPDATE [expression]. For instance,

INSERT INTO tbl (id,name,thing) VALUES (154,'sheeri','book') ON DUPLICATE KEY UPDATE thing='book';

and what makes it easier, if you have variables or whatever in your VALUES, you can actually set the update statement to say “just use the value I wanted to insert, OK?” as in the following:

INSERT INTO tbl (id,name,thing) VALUES (154,'sheeri','book') ON DUPLICATE KEY UPDATE thing=VALUES(thing);

Manual page:
http://dev.mysql.com/doc/refman/5.0/en/insert-on-duplicate.html

Hope this helps!

Ask and you shall receive: http://face.centosprime.com/rdb-w/?p=68 linked to my previous post on the Data Warehousing Tips and Tricks session (http://sheeri.net/archives/204) with the comment, “I need to learn more about MERGE TABLES and INSERT … ON DUPLICATE KEY UPDATE“.

So here’s a bit more:

The manual pages for the MERGE storage engine:
http://dev.mysql.com/doc/refman/5.0/en/merge-storage-engine.html
and
http://dev.mysql.com/doc/refman/5.0/en/merge-table-problems.html

MySQL Forums for the MERGE talbe are at:
http://forums.mysql.com/list.php?93

In a nutshell, a MERGE table is really a set of pointers to similarly-schema’d MyISAM tables. So if you have the same table schema multiple times (ie, partition per day, so you have tables named 2007_04_27_Sales, 2007_04_26_Sales, etc) you’d use a MERGE table to link them all together and then you can run a query on the MERGE table and it will query all the tables that the MERGE table points to.

As for INSERT . . . ON DUPLICATE KEY UPDATE

MySQL gives many ways to deal with INSERTs and unique/primary keys. If you do an INSERT and the primary key you are trying to insert is already in the table, MySQL will give an error. Ways to deal with this:

1) Try & catch errors in the application code.

2) Use INSERT IGNORE INTO . . . this will insert a new record if a record with the key does not exist. If it does exist, nothing happens. Simply add the word “IGNORE” into your INSERT query after INSERT and before INTO.

3) Use REPLACE INTO . . .this will insert a new record if a record with the key does not exist. If a record does exist, MySQL will *delete* the record and then INSERT your record. This can cause problems when you just want to update part of a row, and not insert the whole row again. And it changes timestamps and auto-increment numbers, which may not be a desired result. Simply change the word “INSERT” in your query to “REPLACE”.

4) Use INSERT . . .ON DUPLICATE KEY UPDATE. The syntax is the regular INSERT statement, and at the end add ON DUPLICATE KEY UPDATE [expression]. For instance,

INSERT INTO tbl (id,name,thing) VALUES (154,'sheeri','book') ON DUPLICATE KEY UPDATE thing='book';

and what makes it easier, if you have variables or whatever in your VALUES, you can actually set the update statement to say “just use the value I wanted to insert, OK?” as in the following:

INSERT INTO tbl (id,name,thing) VALUES (154,'sheeri','book') ON DUPLICATE KEY UPDATE thing=VALUES(thing);

Manual page:
http://dev.mysql.com/doc/refman/5.0/en/insert-on-duplicate.html

Hope this helps!

Data Warehousing Tips and Tricks

It’s not easy to do a DW in MySQL — but it’s not impossible either. Easier to go to Teradata than to write your own.

DW characteristics:

1) Organic, evolves over time from OLTP systems — issues, locking, large queries, # of userss.

2) Starts as a copy of OLTP, but changes over time — schema evolution, replication lag, duplicate data issues

3) Custom — designed from the ground up for DW — issues with getting it started, growth, aggregations, backup.

4) How do you update the data in the warehouse? — write/update/read/delete, write/read/delete, or write only — which means that roll out requires partitions or merge tables.

The secret to DW is partitioning — can be based on:
data — date, groups like department, company, etc.
functional — sales, HR, etc.
random — hash, mod on a primary key.

You can partition:
manually — unions, application logic, etc.
using MERGE tables and MyISAM
MySQL 5.1 using partitions

You can load, backup and purge by partition, so perhaps keeping that logic intact — if it takes too much work to load a partition because you’ve grouped it oddly, then your partitioning schema isn’t so great.

Make sure your partitioning is flexible — you need to plan for growth from day 1. So don’t just partition once and forget about it, make sure you can change the partitioning schema without too much trouble. Hash and modulo partitioning aren’t very flexible, and you have to restructure your data to do so.

Use MyISAM for data warehousing — 3-4 times faster than InnoDB, data 2-3 times smaller, MyISAM table files can be easily copied from one server to another, MERGE tables available only over MyISAM tables (scans are 10-15% faster with merge tables), and you can make read-only tables (compressed with indexes) to reduce data size further. ie, compress older data (a year ago, or a week ago if it doesn’t change!)

Issues for using MyISAM for DW — Table locking for high volumes of real-time data (concurrent inserts are allowed when there is ONLY insertions going on, not deletions). This is where partitioning comes in! REPAIR TABLE also takes a long time — better to backup frequently, saving tables, loadset and logs, and then instead of REPAIR TABLE do a point-in-time recovery. For write-only DW, save your write-loads and use that as part of your backup strategy.

Deletes will break concurrent inserts — delayed inserts still work, but they’re not as efficient. You also have to program that in, you can’t, say, replicate using INSERT DELAYED where the master had INSERT.

[Baron’s idea — take current data in InnoDB format, and UNION over other DW tables]

No index clustering for queries that need it — OPTIMIZE TABLE will fix this but it can take a long time to run.

When to use InnoDB — if you must have a high volume of realtime loads — InnoDB record locking is better.

If ALL of your queries can take advantage of index clustering — most or all queries access the data using the primary key (bec. all indexes are clustered together with the primary key, so non-primary key lookups are much faster than regular non-primary key lookups in MySIAM). BUT this means you want to keep your primary keys small. Plus, the more indexes you have, the slower your inserts are, and moreso because of the clustering.

MEMORY storage engine: Use it when you have smaller tables that aren’t updated very often; they’re faster and support hash indexes, which are better for doing single record lookups.

Store the data for the MEMORY engine twice, once in the MEMORY table and once in MyISAM or InnoDB, add queries to the MySQL init script to copy the data from the disk tables to the MEMORY tables upon restart using –init-file=< file name >

ARCHIVE storage engine — use to store older data. More compression than compressed MyISAM, fast inserts, 5.1 supports limited indexes, good performance for full table scans.

Nitro Storage Engine — very high INSERT rates w/ simultaneous queries. Ultra high performance on aggregate operations on index values. Doesn’t require 64-bit server, runs well on 32-bit machines. High performance scans on temporal data, can use partial indexes in ways other engines can’t. http://www.nitrosecurity.com

InfoBright Storage Engine — best compression of all storage engines — 10:1 compression, peak can be as high as 30:1 — includes equivalent of indexes for complex analysis queries. High batch load rates — up to 65GB per hour! Right now it’s Windows only, Linux and other to come. Very good performance for analysis type queries, even working with >5TB data. http://www.infobright.com

Backup — For small tables, just back up. Best option for large tables is copying the data files. If you have a write-only/roll out DB you only need to copy the newly added tables. So you don’t need to keep backing up the same data, just backup the new stuff. Or, just save the load sets. Just backup what changes, and partition smartly.

Tips:
Use INSERT . . . ON DUPLICATE KEY UPDATE to build aggregate tables, when the tables are very large and sorts go to disk, or when you need it real time.

Emulating Star Schema Optimization & Hash Joins — MySQL doesn’t do these, except MEMORY tables can use has indexes. So use a MEMORY INDEX table and optimizer hints to manually do a star schema optimized hash join. Steps:

1) Create a query to filter the fact table
to select all sales from week 1-5 and display by region & store type:

SELECT D.week, S.totalsales, S.locationID, S.storeID
FROM sales S INNER JOIN date D USING (dateID)
WHERE D.week BETWEEN 1 AND 5;

Access only the tables you need for filtering the data, but select the foreign key ID’s.

2) Join the result from step 1 with other facts/tables needed for the report

(SELECT D.week, S.totalsales, S.locationID, S.storeID
FROM sales S INNER JOIN date D USING (dateID)
WHERE D.week BETWEEN 1 AND 5) AS R
INNER JOIN location AS L ON (L.locationID=R.locationID) INNER JOIN store AS S ON (S.storeId=R.storeId);

3) Aggregate the results

(SELECT D.week, S.totalsales, S.locationID, S.storeID
FROM sales S INNER JOIN date D USING (dateID)
WHERE D.week BETWEEN 1 AND 5) AS R
INNER JOIN location AS L ON (L.locationID=R.locationID) INNER JOIN store AS S ON (S.storeId=R.storeId)
GROUP BY week, region, store_type;

Critical configuration options for DW — sort_buffer_size — used to do SELECT DISTINCT, GROUP BY, ORDER BY, UNION DISTINCT (or just UNION)

Watch the value of sort_merge_passes (more than 1 per second or 4-5 per minute) to see if you need to increase sort_buffer_size. sort_buffer_size is a PER-CONNECTION parameter, so don’t be too too greedy…..but it can also be increased dynamically before running a large query, and reduced afterwards.

key_buffer_size – use multiplekey buffer caches. Use difference caches for hot, warm & cold indexes. Preload your key caches at server startup. Try to use 1/4 of memory (up to 4G per key_buffer) for your total key buffer space. Monitor the cache hit rate by watching:

Read hit rate = key_reads/key_read_requests
Write hit rate = key_writes/key_write_requests
Key_reads & key_writes per second are also important.

hot_cache.key_buffer_size = 1G
fred.key_buffer_size = 1G
fred.key_cache_division_limit = 80
key_cache_size = 2G
key_cache_division_limit = 60
init-file = my_init_file.sql

in the init file:

CACHE INDEX T1,T2,T3 INDEX (I1,I2) INTO hot_cache;
CACHE INDEX T4,T5,T3 INDEX (I3,I4) INTO fred;
LOAD INDEX INTO CACHE T1,T3 NO LEAVES; — use when cache isn’t big enough to hold the whole index.
LOAD INDEX INTO CACHE T10, T11, T2, T4, T5

http://dev.mysql.com/doc/refman/5.0/en/myisam-key-cache.html

This was implemented in MySQL 4.1.1

Temporary table sizes — monitor created_disk_tmp_tables — more than a few per minute is bad, one a minute could be bad depending on the query. tmp tables start in memory and then go to disk…increase tmp_table_size and max_heap_table_size — can by done by session, for queries that need >64MB or so of space.

ALWAYS turn on the slow query log! save them for a few logs, use mysqldumpslow to analyze queries daily. Best to have an automated script to run mysqldumpslow and e-mail a report with the 10-25 worst queries.

log_queries_not_using_indexes unless your DW is designed to use explicit full-table scans.

Learn what the explain plan output means & how the optimizer works:
http://forge.mysql.com/wiki/MySQL_Internals_Optimizer

Other key status variables to watch
select_scan — full scan of first table
select_full_join — # of joins doing full table scan ’cause not using indexes
sort_scan — # of sorts that require
table_locks_waited
uptime

mysqladmin extended:
mysqladmin -u user -ppasswd ex =i60 -r | tee states.log | grep -v ‘0’

(runs every 60 seconds, display only status variables that have changed, logs full status to stats.log every 60 seconds).

It’s not easy to do a DW in MySQL — but it’s not impossible either. Easier to go to Teradata than to write your own.

DW characteristics:

1) Organic, evolves over time from OLTP systems — issues, locking, large queries, # of userss.

2) Starts as a copy of OLTP, but changes over time — schema evolution, replication lag, duplicate data issues

3) Custom — designed from the ground up for DW — issues with getting it started, growth, aggregations, backup.

4) How do you update the data in the warehouse? — write/update/read/delete, write/read/delete, or write only — which means that roll out requires partitions or merge tables.

The secret to DW is partitioning — can be based on:
data — date, groups like department, company, etc.
functional — sales, HR, etc.
random — hash, mod on a primary key.

You can partition:
manually — unions, application logic, etc.
using MERGE tables and MyISAM
MySQL 5.1 using partitions

You can load, backup and purge by partition, so perhaps keeping that logic intact — if it takes too much work to load a partition because you’ve grouped it oddly, then your partitioning schema isn’t so great.

Make sure your partitioning is flexible — you need to plan for growth from day 1. So don’t just partition once and forget about it, make sure you can change the partitioning schema without too much trouble. Hash and modulo partitioning aren’t very flexible, and you have to restructure your data to do so.

Use MyISAM for data warehousing — 3-4 times faster than InnoDB, data 2-3 times smaller, MyISAM table files can be easily copied from one server to another, MERGE tables available only over MyISAM tables (scans are 10-15% faster with merge tables), and you can make read-only tables (compressed with indexes) to reduce data size further. ie, compress older data (a year ago, or a week ago if it doesn’t change!)

Issues for using MyISAM for DW — Table locking for high volumes of real-time data (concurrent inserts are allowed when there is ONLY insertions going on, not deletions). This is where partitioning comes in! REPAIR TABLE also takes a long time — better to backup frequently, saving tables, loadset and logs, and then instead of REPAIR TABLE do a point-in-time recovery. For write-only DW, save your write-loads and use that as part of your backup strategy.

Deletes will break concurrent inserts — delayed inserts still work, but they’re not as efficient. You also have to program that in, you can’t, say, replicate using INSERT DELAYED where the master had INSERT.

[Baron’s idea — take current data in InnoDB format, and UNION over other DW tables]

No index clustering for queries that need it — OPTIMIZE TABLE will fix this but it can take a long time to run.

When to use InnoDB — if you must have a high volume of realtime loads — InnoDB record locking is better.

If ALL of your queries can take advantage of index clustering — most or all queries access the data using the primary key (bec. all indexes are clustered together with the primary key, so non-primary key lookups are much faster than regular non-primary key lookups in MySIAM). BUT this means you want to keep your primary keys small. Plus, the more indexes you have, the slower your inserts are, and moreso because of the clustering.

MEMORY storage engine: Use it when you have smaller tables that aren’t updated very often; they’re faster and support hash indexes, which are better for doing single record lookups.

Store the data for the MEMORY engine twice, once in the MEMORY table and once in MyISAM or InnoDB, add queries to the MySQL init script to copy the data from the disk tables to the MEMORY tables upon restart using –init-file=< file name >

ARCHIVE storage engine — use to store older data. More compression than compressed MyISAM, fast inserts, 5.1 supports limited indexes, good performance for full table scans.

Nitro Storage Engine — very high INSERT rates w/ simultaneous queries. Ultra high performance on aggregate operations on index values. Doesn’t require 64-bit server, runs well on 32-bit machines. High performance scans on temporal data, can use partial indexes in ways other engines can’t. http://www.nitrosecurity.com

InfoBright Storage Engine — best compression of all storage engines — 10:1 compression, peak can be as high as 30:1 — includes equivalent of indexes for complex analysis queries. High batch load rates — up to 65GB per hour! Right now it’s Windows only, Linux and other to come. Very good performance for analysis type queries, even working with >5TB data. http://www.infobright.com

Backup — For small tables, just back up. Best option for large tables is copying the data files. If you have a write-only/roll out DB you only need to copy the newly added tables. So you don’t need to keep backing up the same data, just backup the new stuff. Or, just save the load sets. Just backup what changes, and partition smartly.

Tips:
Use INSERT . . . ON DUPLICATE KEY UPDATE to build aggregate tables, when the tables are very large and sorts go to disk, or when you need it real time.

Emulating Star Schema Optimization & Hash Joins — MySQL doesn’t do these, except MEMORY tables can use has indexes. So use a MEMORY INDEX table and optimizer hints to manually do a star schema optimized hash join. Steps:

1) Create a query to filter the fact table
to select all sales from week 1-5 and display by region & store type:

SELECT D.week, S.totalsales, S.locationID, S.storeID
FROM sales S INNER JOIN date D USING (dateID)
WHERE D.week BETWEEN 1 AND 5;

Access only the tables you need for filtering the data, but select the foreign key ID’s.

2) Join the result from step 1 with other facts/tables needed for the report

(SELECT D.week, S.totalsales, S.locationID, S.storeID
FROM sales S INNER JOIN date D USING (dateID)
WHERE D.week BETWEEN 1 AND 5) AS R
INNER JOIN location AS L ON (L.locationID=R.locationID) INNER JOIN store AS S ON (S.storeId=R.storeId);

3) Aggregate the results

(SELECT D.week, S.totalsales, S.locationID, S.storeID
FROM sales S INNER JOIN date D USING (dateID)
WHERE D.week BETWEEN 1 AND 5) AS R
INNER JOIN location AS L ON (L.locationID=R.locationID) INNER JOIN store AS S ON (S.storeId=R.storeId)
GROUP BY week, region, store_type;

Critical configuration options for DW — sort_buffer_size — used to do SELECT DISTINCT, GROUP BY, ORDER BY, UNION DISTINCT (or just UNION)

Watch the value of sort_merge_passes (more than 1 per second or 4-5 per minute) to see if you need to increase sort_buffer_size. sort_buffer_size is a PER-CONNECTION parameter, so don’t be too too greedy…..but it can also be increased dynamically before running a large query, and reduced afterwards.

key_buffer_size – use multiplekey buffer caches. Use difference caches for hot, warm & cold indexes. Preload your key caches at server startup. Try to use 1/4 of memory (up to 4G per key_buffer) for your total key buffer space. Monitor the cache hit rate by watching:

Read hit rate = key_reads/key_read_requests
Write hit rate = key_writes/key_write_requests
Key_reads & key_writes per second are also important.

hot_cache.key_buffer_size = 1G
fred.key_buffer_size = 1G
fred.key_cache_division_limit = 80
key_cache_size = 2G
key_cache_division_limit = 60
init-file = my_init_file.sql

in the init file:

CACHE INDEX T1,T2,T3 INDEX (I1,I2) INTO hot_cache;
CACHE INDEX T4,T5,T3 INDEX (I3,I4) INTO fred;
LOAD INDEX INTO CACHE T1,T3 NO LEAVES; — use when cache isn’t big enough to hold the whole index.
LOAD INDEX INTO CACHE T10, T11, T2, T4, T5

http://dev.mysql.com/doc/refman/5.0/en/myisam-key-cache.html

This was implemented in MySQL 4.1.1

Temporary table sizes — monitor created_disk_tmp_tables — more than a few per minute is bad, one a minute could be bad depending on the query. tmp tables start in memory and then go to disk…increase tmp_table_size and max_heap_table_size — can by done by session, for queries that need >64MB or so of space.

ALWAYS turn on the slow query log! save them for a few logs, use mysqldumpslow to analyze queries daily. Best to have an automated script to run mysqldumpslow and e-mail a report with the 10-25 worst queries.

log_queries_not_using_indexes unless your DW is designed to use explicit full-table scans.

Learn what the explain plan output means & how the optimizer works:
http://forge.mysql.com/wiki/MySQL_Internals_Optimizer

Other key status variables to watch
select_scan — full scan of first table
select_full_join — # of joins doing full table scan ’cause not using indexes
sort_scan — # of sorts that require
table_locks_waited
uptime

mysqladmin extended:
mysqladmin -u user -ppasswd ex =i60 -r | tee states.log | grep -v ‘0’

(runs every 60 seconds, display only status variables that have changed, logs full status to stats.log every 60 seconds).

Japanese Character Set

There are too many Japanese characters to be able to use one byte to handle all of them.

Hiragana — over 50 characters

Katakana — over 50 characters

Kanji — over 6,000 characters

So the Japanese Character set has to be multi-byte. JIS=Japan Industrial Standard, this specifies it.

JIS X 0208 in 1990, updated in 1997 — covers widely used characters, not all characters
JIS X 0213 in 2000, updated in 2004

There are also vendor defined Japanese charsets — NEC Kanji and IBM Kanji — these supplement JIS X 0208.

Cellphone specific symbols have been introduced, so the # of characters is actually increasing!

For JIS X 0208, there are multiple encodings — Shift_JIS (all characters are 2 bytes), EUC-JP (most are 2 bytes, some are 3 bytes), and Unicode (all characters are 3 bytes, this makes people not want to use UTF-8 for ). Shift_JIS is most widely used, but they are moving to Unicode gradually (Vista is using UTF-8 as the standard now). Each code mapping is different, with different hex values for the same character in different encodings.

Similarly, there are different encodings for the other charsets.

MySQL supports only some of these. (get the graph from the slides)

char_length() returns the length by # of characters, length() returns the length by # of bytes.

The connection charset and the server charset have to match otherwise…mojibake!

Windows — Shift_JIS is standard encoding, linux EUC-JP is standard. So conversion may be needed.

MySQL Code Conversion algorithm — UCS-2 facilitates conversion between encodings. MySQL converts mappings to and from UCS-2. If client and server encoding are the same, there’s no conversion. If the conversion fails (ie, trying to convert to latin1), the character is converted to ? and you get mojibake.

You can set a my.cnf paramater for “skip-character-set-client-handshake”, this forces the use of the server side charset (for the column(s) in question).

Issues:

Unicode is supposed to support worldwide characters.

UCS-2 is 2-byte fixed length, takes 2^16 = 65,536 characters. This is one Basic Multilingual Plane (BMP). Some Japanese (and Chinese) characters are not covered by UCS-2. Windows Visa supports JIS X 0213:2004 as a standard character set in Japan (available for Windows XP with the right )

UCS-4 is 4-byte fixed length, can encode 2^31 characters (~2 billion) This covers many BMP’s (128?)

UTF-16 is 2 or 4 byte length, all UCS-2 are mapped to 2 bytes, not all UCS-4 characters are supported — 1 million are. Supported UCS-4 characters are mapped to 4 bytes

UTF-8 from 1-6 bytes is fully compliant with UCS-4. This is out of date. 1-4 byte UTF-8 is fully compliant with UTF-16. From 1-3 bytes, UTF-8 is compliant with UCS-2.

MySQL interally handles all characters as UCS-2, UCS-4 is not supported. This is not enough. Plus, UCS-2 is not supported for client encoding. UTF-8 support is up to 3 bytes — this is not just a MySQL problem though.

CREATE TABLE t1 (c1 VARCHAR(30)) CHARSET=utf8;
INSERT INTO T1 VALUES (0x6162F0A0808B63646566); — this inserts ‘ab’ + 4-byte UTF-8 translation of cdef

SELECT c1,HEX(c1) from t1;
if you get ab,6162 back it means that the invalid character was truncated. MySQL does throw up a warning for this.

Possible workarounds — using VARBINARY/BLOB types. Can store any binary data, but this is always case-sensitive (and yes, Japanese characters do have case). FULLTEXT index is not supported, and application code may need to be modified to handle UTF-8 — ie, String.getBytes may need “UTF-8” parameter in it.

Alternatively, use UCS-2 for column encoding:

CREATE TABLE t1 (c1 VARCHAR(30)) CHARSET=ucs2;

INSERT INTO t1 VALUES (_utf8 0x6162F0A0808B63646566);

SELECT … now gives you ?? instead of truncating.

Another alternative: use Shift_JIS or EUC-JP. Code conversion of JIS X 0213:2004 characters is not currently supported.

Shift_JIS is the most widely used encoding, 1 or 2 byte encoding. All ASCII and 1/2 width katakana are 1-byte, the rest are 2-byte. If the first byte value is between 0x00 and 0x7F it’s ASCII 1 byte, 0XA0 – 0XDf is 1-byte, 1/2 width katakana. all the rest are 2-byte characters.

The 2nd byte might be in the ASCII graphic code area 0x40 for example.

0x5C is the escape sequence (backslash in the US). Some Shift_JIS characters contain 0x5C in the 2nd byte. If the charset is specified incorrectly, you’ll end up getting different values — for instance, hex value 0X5C6e will conver to hex value 0X0A. The backslash at the end of the string, hex value 0X5C, will be removed (truncated) if charset is specified incorrectly.

Native MySQL does not support FULLTEXT search in Japanese, Korean and Chinese (CJK issue).

Japanese words do not delimit by space, so it can’t work. 2 ways to do this: dictionary based indexing, dividing words using a pre-installed dictionary. Also N-gram indexing — divide text by N letters (n could be 1, 2, 3 etc). MySQL + Senna implements this, supported by Sumisho Computer Systems.

There are too many Japanese characters to be able to use one byte to handle all of them.

Hiragana — over 50 characters

Katakana — over 50 characters

Kanji — over 6,000 characters

So the Japanese Character set has to be multi-byte. JIS=Japan Industrial Standard, this specifies it.

JIS X 0208 in 1990, updated in 1997 — covers widely used characters, not all characters
JIS X 0213 in 2000, updated in 2004

There are also vendor defined Japanese charsets — NEC Kanji and IBM Kanji — these supplement JIS X 0208.

Cellphone specific symbols have been introduced, so the # of characters is actually increasing!

For JIS X 0208, there are multiple encodings — Shift_JIS (all characters are 2 bytes), EUC-JP (most are 2 bytes, some are 3 bytes), and Unicode (all characters are 3 bytes, this makes people not want to use UTF-8 for ). Shift_JIS is most widely used, but they are moving to Unicode gradually (Vista is using UTF-8 as the standard now). Each code mapping is different, with different hex values for the same character in different encodings.

Similarly, there are different encodings for the other charsets.

MySQL supports only some of these. (get the graph from the slides)

char_length() returns the length by # of characters, length() returns the length by # of bytes.

The connection charset and the server charset have to match otherwise…mojibake!

Windows — Shift_JIS is standard encoding, linux EUC-JP is standard. So conversion may be needed.

MySQL Code Conversion algorithm — UCS-2 facilitates conversion between encodings. MySQL converts mappings to and from UCS-2. If client and server encoding are the same, there’s no conversion. If the conversion fails (ie, trying to convert to latin1), the character is converted to ? and you get mojibake.

You can set a my.cnf paramater for “skip-character-set-client-handshake”, this forces the use of the server side charset (for the column(s) in question).

Issues:

Unicode is supposed to support worldwide characters.

UCS-2 is 2-byte fixed length, takes 2^16 = 65,536 characters. This is one Basic Multilingual Plane (BMP). Some Japanese (and Chinese) characters are not covered by UCS-2. Windows Visa supports JIS X 0213:2004 as a standard character set in Japan (available for Windows XP with the right )

UCS-4 is 4-byte fixed length, can encode 2^31 characters (~2 billion) This covers many BMP’s (128?)

UTF-16 is 2 or 4 byte length, all UCS-2 are mapped to 2 bytes, not all UCS-4 characters are supported — 1 million are. Supported UCS-4 characters are mapped to 4 bytes

UTF-8 from 1-6 bytes is fully compliant with UCS-4. This is out of date. 1-4 byte UTF-8 is fully compliant with UTF-16. From 1-3 bytes, UTF-8 is compliant with UCS-2.

MySQL interally handles all characters as UCS-2, UCS-4 is not supported. This is not enough. Plus, UCS-2 is not supported for client encoding. UTF-8 support is up to 3 bytes — this is not just a MySQL problem though.

CREATE TABLE t1 (c1 VARCHAR(30)) CHARSET=utf8;
INSERT INTO T1 VALUES (0x6162F0A0808B63646566); — this inserts ‘ab’ + 4-byte UTF-8 translation of cdef

SELECT c1,HEX(c1) from t1;
if you get ab,6162 back it means that the invalid character was truncated. MySQL does throw up a warning for this.

Possible workarounds — using VARBINARY/BLOB types. Can store any binary data, but this is always case-sensitive (and yes, Japanese characters do have case). FULLTEXT index is not supported, and application code may need to be modified to handle UTF-8 — ie, String.getBytes may need “UTF-8” parameter in it.

Alternatively, use UCS-2 for column encoding:

CREATE TABLE t1 (c1 VARCHAR(30)) CHARSET=ucs2;

INSERT INTO t1 VALUES (_utf8 0x6162F0A0808B63646566);

SELECT … now gives you ?? instead of truncating.

Another alternative: use Shift_JIS or EUC-JP. Code conversion of JIS X 0213:2004 characters is not currently supported.

Shift_JIS is the most widely used encoding, 1 or 2 byte encoding. All ASCII and 1/2 width katakana are 1-byte, the rest are 2-byte. If the first byte value is between 0x00 and 0x7F it’s ASCII 1 byte, 0XA0 – 0XDf is 1-byte, 1/2 width katakana. all the rest are 2-byte characters.

The 2nd byte might be in the ASCII graphic code area 0x40 for example.

0x5C is the escape sequence (backslash in the US). Some Shift_JIS characters contain 0x5C in the 2nd byte. If the charset is specified incorrectly, you’ll end up getting different values — for instance, hex value 0X5C6e will conver to hex value 0X0A. The backslash at the end of the string, hex value 0X5C, will be removed (truncated) if charset is specified incorrectly.

Native MySQL does not support FULLTEXT search in Japanese, Korean and Chinese (CJK issue).

Japanese words do not delimit by space, so it can’t work. 2 ways to do this: dictionary based indexing, dividing words using a pre-installed dictionary. Also N-gram indexing — divide text by N letters (n could be 1, 2, 3 etc). MySQL + Senna implements this, supported by Sumisho Computer Systems.

MySQL Security Talk slides

For those wanting the slides for “Testing the Security of Your Site”, they’re at:

http://www.sheeri.com/presentations/MySQLSecurity2007_04_24.pdf — 108 K PDF file

http://www.sheeri.com/presentations/MySQLSecurity2007_04_24.swf — 56 K Flash file

and some code:

For the UserAuth table I use in the example to test SQL injection (see slides):

CREATE TABLE UserAuth (userId INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY, uname VARCHAR(20) NOT NULL DEFAULT '' UNIQUE KEY, pass VARCHAR(32) NOT NULL DEFAULT '') ENGINE=INNODB DEFAULT CHARSET=UTF8;

Populate the table:

INSERT INTO UserAuth (uname) VALUES ('alef'),('bet'),('gimel'),('daled'),('hay'),('vav'),('zayin'),('chet'),('tet'),('yud'),('kaf'),('lamed'),('mem'),('nun'),('samech'),('ayin'),('pe'),('tsadik'),('kuf'),('resh'),('shin'),('tav');
UPDATE UserAuth SET pass=MD5(uname) WHERE 1=1;

Test some SQL injection yourself:
go to Acunetix’s test site: http://testasp.acunetix.com/login.asp

Type any of the following as your password, with any user name:
anything' OR 'x'='x
anything' OR '1'='1
anything' OR 1=1
anything' OR 1/'0
anything' UNION SELECT 'a
anything'; SELECT * FROM Users; select '
1234' AND 1=0 UNION ALL SELECT 'admin', '81dc9bdb52d04dc20036dbd8313ed055

And perhaps some of the following:
ASCII/Unicode equivalents (CHAR(39) is single quote)
Hex equivalents (0x27, ie SELECT 0x27726F6F7427)
— for comments

For those wanting the slides for “Testing the Security of Your Site”, they’re at:

http://www.sheeri.com/presentations/MySQLSecurity2007_04_24.pdf — 108 K PDF file

http://www.sheeri.com/presentations/MySQLSecurity2007_04_24.swf — 56 K Flash file

and some code:

For the UserAuth table I use in the example to test SQL injection (see slides):

CREATE TABLE UserAuth (userId INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY, uname VARCHAR(20) NOT NULL DEFAULT '' UNIQUE KEY, pass VARCHAR(32) NOT NULL DEFAULT '') ENGINE=INNODB DEFAULT CHARSET=UTF8;

Populate the table:

INSERT INTO UserAuth (uname) VALUES ('alef'),('bet'),('gimel'),('daled'),('hay'),('vav'),('zayin'),('chet'),('tet'),('yud'),('kaf'),('lamed'),('mem'),('nun'),('samech'),('ayin'),('pe'),('tsadik'),('kuf'),('resh'),('shin'),('tav');
UPDATE UserAuth SET pass=MD5(uname) WHERE 1=1;

Test some SQL injection yourself:
go to Acunetix’s test site: http://testasp.acunetix.com/login.asp

Type any of the following as your password, with any user name:
anything' OR 'x'='x
anything' OR '1'='1
anything' OR 1=1
anything' OR 1/'0
anything' UNION SELECT 'a
anything'; SELECT * FROM Users; select '
1234' AND 1=0 UNION ALL SELECT 'admin', '81dc9bdb52d04dc20036dbd8313ed055

And perhaps some of the following:
ASCII/Unicode equivalents (CHAR(39) is single quote)
Hex equivalents (0x27, ie SELECT 0x27726F6F7427)
— for comments

SQL Antipatterns — Bill Karwin

Well, I came late, so I missed the first one….so we start with #2

#2. Ambiguous GROUP BY —

query BUGS and include details on the corresponding PRODUCT rows —

SELECT b.bug_id, p.product_name from bugs b NATURAL JOIN bugs_products NATURAL JOIN products GROUP BY b.bug_id;

We use the GROUP BY to get one row per bug, but then you lose information.

Standard SQL says that GROUP BY queries require the GROUP BY columns to be in the SELECT clause. MySQL does not enforce this. If a column is in a SELECT clause but not in the GROUP BY clause it displays a random value.

[my note, not said in the presentation this fools people when they want the groupwise maximum, they think that selecting multiple columns and grouping means that you get some particular row ]

Solution 1: Restrict yourself to standard SQL — do not allow columns in SELECT if
use GROUP BY

Solution 2: Use GROUP_CONCAT() to get a comma-separated list of distinct values in the row.

SELECT b.bug_id, GROUP_CONCAT(p.product_name) AS product_names FROM bugs b NATURAL JOIN bugs_products NATURAL JOIN products GROUP BY b.bug_id;

Performance: no worse than doing a regular group function because the concat happens after the grouping is done.

3. EAV Tables — Entity-Attribute-Value Tables.

Example: product catalog w/ attributes, too many to use one column per attribute. Not every product has every attribute. ie, DVD’s don’t have pages and books don’t have a region encoding.

Most people make an “eav” table, that has the attribute name and value and the entity name. It associates the entity name (say, “spiderman DVD”) with an attribute (“region encoding”) and value (“region 1”)

Why is this bad? It’s harder to apply constraints because the column may have many different values (ie, # of pages should be a number but region encoding is a character). This may be a sign of a bad data model.

So why is this bad?

EAV cannot require an attribute — if you were doing many columns per table, you could specify NOT NULL (ie, price). Well, you could do that with TRIGGERs, but MySQL does not raise errors or abort an operation that spawned a trigger — in other words, you can’t stop the row from being inserted, just have an event when a row is inserted.

EAV cannot have referential integrity to multiple lookup tables, or only for some values.

It’s also expensive and complex to find all the attributes for one entity. In order to get one row that looks like normalized data, you need one join per attribute, and you may not even know how many there are.

Solution: Try not to use EAV tables, defining your attributes in your data model (ie, one table per attribute type). If you do, application logic should enforce constraints. Don’t try to fetch attributes in a single row (that looks like normalized data); fetch multiple rows and use the application code to reconstruct the entity.

4. Letting users crash your server
Example: people request ability to query database flexibility. So the antipattern is to give them access to run their own SQL.

Solution: give an interface which allows parameters to queries. But watch out for SQL injection!

Filter input escaping strings, or use parameterized queries.

6. Forcing primary keys to be contiguous

Example: managers don’t like gaps in invoice #’s. Antipattern is to try to reuse primary key values to fill in the gaps. Another antipattern is to change values to close the gaps.

Solution — deal with it. Do not reuse primary keys. Also, do not use auto_increment surrogate keys for everything if you do not need to.

Well, I came late, so I missed the first one….so we start with #2

#2. Ambiguous GROUP BY —

query BUGS and include details on the corresponding PRODUCT rows —

SELECT b.bug_id, p.product_name from bugs b NATURAL JOIN bugs_products NATURAL JOIN products GROUP BY b.bug_id;

We use the GROUP BY to get one row per bug, but then you lose information.

Standard SQL says that GROUP BY queries require the GROUP BY columns to be in the SELECT clause. MySQL does not enforce this. If a column is in a SELECT clause but not in the GROUP BY clause it displays a random value.

[my note, not said in the presentation this fools people when they want the groupwise maximum, they think that selecting multiple columns and grouping means that you get some particular row ]

Solution 1: Restrict yourself to standard SQL — do not allow columns in SELECT if
use GROUP BY

Solution 2: Use GROUP_CONCAT() to get a comma-separated list of distinct values in the row.

SELECT b.bug_id, GROUP_CONCAT(p.product_name) AS product_names FROM bugs b NATURAL JOIN bugs_products NATURAL JOIN products GROUP BY b.bug_id;

Performance: no worse than doing a regular group function because the concat happens after the grouping is done.

3. EAV Tables — Entity-Attribute-Value Tables.

Example: product catalog w/ attributes, too many to use one column per attribute. Not every product has every attribute. ie, DVD’s don’t have pages and books don’t have a region encoding.

Most people make an “eav” table, that has the attribute name and value and the entity name. It associates the entity name (say, “spiderman DVD”) with an attribute (“region encoding”) and value (“region 1”)

Why is this bad? It’s harder to apply constraints because the column may have many different values (ie, # of pages should be a number but region encoding is a character). This may be a sign of a bad data model.

So why is this bad?

EAV cannot require an attribute — if you were doing many columns per table, you could specify NOT NULL (ie, price). Well, you could do that with TRIGGERs, but MySQL does not raise errors or abort an operation that spawned a trigger — in other words, you can’t stop the row from being inserted, just have an event when a row is inserted.

EAV cannot have referential integrity to multiple lookup tables, or only for some values.

It’s also expensive and complex to find all the attributes for one entity. In order to get one row that looks like normalized data, you need one join per attribute, and you may not even know how many there are.

Solution: Try not to use EAV tables, defining your attributes in your data model (ie, one table per attribute type). If you do, application logic should enforce constraints. Don’t try to fetch attributes in a single row (that looks like normalized data); fetch multiple rows and use the application code to reconstruct the entity.

4. Letting users crash your server
Example: people request ability to query database flexibility. So the antipattern is to give them access to run their own SQL.

Solution: give an interface which allows parameters to queries. But watch out for SQL injection!

Filter input escaping strings, or use parameterized queries.

6. Forcing primary keys to be contiguous

Example: managers don’t like gaps in invoice #’s. Antipattern is to try to reuse primary key values to fill in the gaps. Another antipattern is to change values to close the gaps.

Solution — deal with it. Do not reuse primary keys. Also, do not use auto_increment surrogate keys for everything if you do not need to.

OurSQL Episode 13: The Nitty Gritty of Indexes

In this episode, we go through how a B-tree works. The next episode will use what we learn in this episode to explain why MySQL indexes work the way they do.

Direct play this episode at:
http://technocation.org/content/oursql-episode-13%3A-nitty-gritty-indexes-0

Download all podcasts at:
http://technocation.org/podcasts/oursql/

Subscribe to the podcast at:
http://feeds.feedburner.com/oursql

Register for the MySQL Conference now!:
http://www.mysqlconf.com

Quiz to receive a free certification voucher from Proven Scaling:
http://www.provenscaling.com/freecert

MySQL Full Reference Cards:
http://www.visibone.com/sql

About B-Trees:
http://www.semaphorecorp.com/btp/algo.html

http://perl.plover.com/BTree/article.txt

Feedback:

Email podcast@technocation.org

call the comment line at +1 617-674-2369

use Odeo to leave a voice mail through your computer:
http://odeo.com/sendmeamessage/Sheeri

Or use the Technocation forums:
http://technocation.org/forum

In this episode, we go through how a B-tree works. The next episode will use what we learn in this episode to explain why MySQL indexes work the way they do.

Direct play this episode at:
http://technocation.org/content/oursql-episode-13%3A-nitty-gritty-indexes-0

Download all podcasts at:
http://technocation.org/podcasts/oursql/

Subscribe to the podcast at:
http://feeds.feedburner.com/oursql

Register for the MySQL Conference now!:
http://www.mysqlconf.com

Quiz to receive a free certification voucher from Proven Scaling:
http://www.provenscaling.com/freecert

MySQL Full Reference Cards:
http://www.visibone.com/sql

About B-Trees:
http://www.semaphorecorp.com/btp/algo.html

http://perl.plover.com/BTree/article.txt

Feedback:

Email podcast@technocation.org

call the comment line at +1 617-674-2369

use Odeo to leave a voice mail through your computer:
http://odeo.com/sendmeamessage/Sheeri

Or use the Technocation forums:
http://technocation.org/forum