Sheeri Dot Org – Page 18 – Musings on topics including product management, tech, art.

Data Warehousing Best Practices: Comparing Oracle to MySQL, part 1 (introduction and power)

At Kscope this year, I attended a half day in-depth session entitled Data Warehousing Performance Best Practices, given by Maria Colgan of Oracle. My impression, which was confirmed by folks in the Oracle world, is that she knows her way around the Oracle optimizer.

These are my notes from the session, which include comparisons of how Oracle works (which Maria gave) and how MySQL works (which I researched to figure out the difference, which is why this blog post took a month after the conference to write). Note that I am not an expert on data warehousing in either Oracle or MySQL, so these are more concepts to think about than hard-and-fast advice. In some places, I still have questions, and I am happy to have folks comment and contribute what they know.

One interesting point brought up:
Maria quoted someone (she said the name but I did not grab it) from Forrester saying, “3NF is typically a selfless model used by Enterprise data warehouse, which is used by the whole company. A star schema is a selfish model, used by a department, because it’s already got aggregation in it.”

I thought that was an interesting way of pointing that out — most people do not understand why 3NF is not good enough for data warehousing, and I have had a hard time explaining why a star or snowflake schema should be used. Another schema-related topic I had a hard time putting into words before this workshop was the difference between a star and a snowflake schema: compared to a star schema, in a snowflake schema, you have more than one fact table and maybe some dimensions that are not used often.

From Maria and the slides:
“Oracle says model what will suit your business best. Don’t get lost in academia. Most schemas are not 100% according to the theoretical
models. Some examples: 3NF schema with denormalized attributes to avoid costly joins, Star schema with multiple hierarchies in same fact table.”

Data warehousing has a 3-step approach —

1) data sources -> staging layer (temp loading layer)
2) staging layer (temp loading layer)-> foundation (logical, data store) layer
3) foundation (logical, data store) layer -> access and performance layer

The foundation layer is usually 3NF the access layer is usually a star or snowflake schema. As for the data sources, they can be varied, you would hope that they are in 3NF (and if they are you can skip the first 2 steps) but they are not always that way.

The 3 P’s of best practice for data warehousing (on Oracle) are power, partitioning, parallelism. The goal of the data warehousing environment is to minimize the amount of data accessed and use the most efficient joins – so it is not so index focused. This may be based on Oracle’s way of doing joins, I am not so sure if it applies to MySQL as well.

Power The weakest link in the chain (the 3 steps above) will define the throughput, so make sure your hardware configuration is balanced. Maria mentioned that as DBAs, “most of the time we don’t have control over this, but we’re still bound to the SLAs.”

This includes hardware that immediately comes to mind such as # of CPUs/cores, speed of CPU, amount of RAM, speed of disk as well as what we may not think of immediately: speed of network switches, speed of disk controllers, number and speed of host BUS adapters. Notes on host BUS adapters (HBAs): Know the # of HBA ports you have. 4 Gb HBA does 400 Mb/sec. 2 Gb HBA does 200 Mb/sec. Make sure there’s enough HBA capacity to sustain the CPU throughput (ie, make sure HBA isn’t the bottleneck). Also the speed at which it all talks. If you have a 4 Gb machine but a 2 Gb switch, you end up having 2 Gb throughput. Upgrade the network at the same time you upgrade machines.

Because we are talking about data warehousing, it is often not possible to eliminate disk I/O, so the goal is to have the fastest I/O throughput possible. Data warehouses need to be sized on I/O throughput not number of I/O’s.

I made a post earlier about how to determine I/O throughput for a system, which used information from this session. Justin Swanhart already pointed out that this is based on the fact that Oracle can do hash joins and MySQL can only do nested loop joins. I wonder, though, if there is indeed no case when using MySQL for which I/O throughput is a more useful metric than iops.

Disk arrays that are expensive are usually sized for iops, not throughput, and because they’re expensive the disk array is shared throughout the company. A DBA needs to ask ‘how many connections into the storage array do I have? How many disk controllers do I have? Where are my physical disks, and which controllers are they hanging off of?’

Typical 15k rpm disk can do about 25-35 Mb/sec (per disk) random i/o’s. Disk manufacturers will throw out numbers like 200-300 Mb/sec but that’s sequential I/O and leading edge of the drive. Make sure all your LUNs are not coming off the same set of disks, so that you’re not conflicting on disk seeks.

Continue to part 2, partitioning.

Data warehousing has a 3-step approach —

Continue to part 2, partitioning.

Determining I/O throughput for a system

At Kscope this year, I attended a half day in-depth session entitled Data Warehousing Performance Best Practices, given by Maria Colgan of Oracle. In that session, there was a section on how to determine I/O throughput for a system, because in data warehousing I/O per second (iops) is less important than I/O throughput (how much actual data goes through, not just how many reads/writes).

The section contained an Oracle-specific in-database tool, and a standalone tool that can be used on many operating systems, regardless of whether or not a database exists:

If Oracle is installed, run DBMS_RESOURCE_MANAGER.CALIBRATE_IO:

SET SERVEROUTPUT ON
DECLARE
lat INTEGER;
iops INTEGER;
mbps INTEGER;
BEGIN
-- DBMS_RESOURCE_MANAGER.CALIBRATE_IO(<DISKS>, <MAX_LATENCY>,iops,mbps,lat);
DBMS_RESOURCE_MANAGER.CALIBRATE_IO (2, 10, iops, mbps, lat);
dbms_output.put_line('max_mbps = ' || mbps);
end;

For us MySQL folks, or even the Drizzle or NoSQL folks, Oracle offers a free standalone tool called Orion. The example given in the slides was:

./orion –run advanced –testname mytest –num_small 0 –size_large 1024 –type rand –simulate contact –write 0 –duration 60 –matrix column

-num_small is 0 because you don’t usually do small transactions in a dw.
-type rand for random I/O’s because data warehouse queries usually don’t do sequential reads
-write 0 – no writes, because you do not write often to the dw, that is what the ETL is for.
-duration is in seconds
-matrix column shows you how much you can sustain

I would be interested to see how other folks measure I/O throughput, and maybe even do a side-by-side comparison of different tools. Orion is available for:

Linux (x86, x86-64, Itanium, Power)
Solaris (SPARC64)
AIX (PPC64)
zLinux
HPUX (PA RISC, Itanium)
Windows

I am working on a larger write-up of the session itself, which had many concise descriptions of data warehousing issues, but I thought that this merited its own post.

SET SERVEROUTPUT ON
DECLARE
lat INTEGER;
iops INTEGER;
mbps INTEGER;
BEGIN
-- DBMS_RESOURCE_MANAGER.CALIBRATE_IO(<DISKS>, <MAX_LATENCY>,iops,mbps,lat);
DBMS_RESOURCE_MANAGER.CALIBRATE_IO (2, 10, iops, mbps, lat);
dbms_output.put_line('max_mbps = ' || mbps);
end;

For us MySQL folks, or even the Drizzle or NoSQL folks, Oracle offers a free standalone tool called Orion. The example given in the slides was:

./orion –run advanced –testname mytest –num_small 0 –size_large 1024 –type rand –simulate contact –write 0 –duration 60 –matrix column

I would be interested to see how other folks measure I/O throughput, and maybe even do a side-by-side comparison of different tools. Orion is available for:

Linux (x86, x86-64, Itanium, Power)
Solaris (SPARC64)
AIX (PPC64)
zLinux
HPUX (PA RISC, Itanium)
Windows

I am working on a larger write-up of the session itself, which had many concise descriptions of data warehousing issues, but I thought that this merited its own post.

Online Verification That Master and Slaves are in Sync

In October 2008, Baron posted How to Check MySQL Replication Integrity Continually. Here at Pythian we have developed a method based on that post, and added “verifying that masters and slaves are in sync” to our standard battery of tests.

We call it “Continual replication sync checking”. This article will explain how it works, how to test and make the procedure non-blocking, benchmarks from the real world, issues we encountered along the way, and finally Pythian’s procedure of setting up continual replication sync in a new environment.

At the 2010 MySQL User Conference & Expo, my co-worker Danil Zburivsky did a presentation about this, and you can get the slides in ODP (Open Office) format or watch the 46-minute video on YouTube.

How it works

On the master, mk-table-checksum is run. In order to make the checksum operation online, we use the modulo and offset features of mk-table-checksum to checksum only part of the data at a time. The checksum is run (from cron) on the master and replicates to the slave. The results are captured in a result table, and a separate process checks the result table and notifies us of any discrepancies.

Testing resource usage for non-blocking replication sync checking

Each environment has a different data size, different resources, and a different level of what “intrusive” means. We start with a baseline of the size of the database, which we get from:

SELECT SUM(INDEX_LENGTH+DATA_LENGTH)/1024/1024/1024 as sizeGb FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_TYPE='BASE TABLE';

I included INDEX_LENGTH because the queries that generate the checksum information may or may not depend on indexes. Note that for InnoDB tables, the metadata is approximate, so this will not necessarily be the exact size of the database, but it will be in the right range. In general, running this query in the INFORMATION_SCHEMA did not consume too many resources, though in many cases the query took a few minutes to complete. On databases with several thousand tables, this query can take hours to complete.

Of course there is the caveat that it may take a VERY long time to run this in your environment, and some folks have reported crashing (this is VERY rare, I personally have not seen it even with the high demand client we have had, but others have reported it). So if you prefer to be on the safe side, you could look at the size of your ibdata files (whether you are innodb_file_per_table or not) plus the sizes of all the tables on disk (MyISAM, CSV, etc). That will give you a sense of how big the data is, although the size of the files on disk also includes any fragmentation. In practice, I have seen fragmentation reduce the size of a large database by 40%, so beware that neither of these methods is perfect.

Once we get the data size in gigabytes, we can come up with a modulo to test. In general, we have found that mk-table-checksum is quick enough that it is non-blocking when we use a modulo value that corresponds to approximately 1 Gb of data, and a chunk-size of 100. We actually started out using a checksum of 1000, but locking 1000 rows to test a checksum proved to be longer than was acceptable for clients with lots of transactions — the problem was not the locking on the master, the problem was that it created a slave lag on the slaves. We tried using the --sleep option to mk-table-checksum, which worked, but changing the chunksize to be smaller caused less slave lag than adding a sleep parameter.

Another issue when using chunk-size is that it requires a numeric (but non-decimal) index. We ran across clients using VARCHAR(32) as a PRIMARY key (yes, on InnoDB, and yes, performance was an issue there), or tables that did not have any indexes at all (such as a logging table). If mk-table-checksum does not find a suitable index it will just do a checksum on the whole table.

To find tables that would be problematic, here is an INFORMATION_SCHEMA query that can be run (again, the usual caveats about INFORMATION_SCHEMA apply):

SELECT CONCAT("SHOW CREATE TABLE ", table_schema,".",table_name,"\\G")
FROM TABLES
LEFT JOIN (
     SELECT distinct table_schema,table_name
     FROM STATISTICS INNER JOIN COLUMNS USING
       (table_schema,table_name,column_name)
     WHERE (data_type like '%int' or data_type like 'date%'
     or data_type like 'time%' or data_type='float') and seq_in_index=1
     ) AS good_for_checksum
USING (table_schema,table_name)
WHERE good_for_checksum.table_name IS NULL
and table_schema not in ('information_schema');

Frequency of checksum runs

The frequency that we run the checksum is also very flexible, so we take the size and translate that into a modulo that is “even” in a time-based way. For example, on a server that reported 113 Gb in size from the INFORMATION_SCHEMA query above, we set the modulo to 120. The checksum took 10 minutes and 6 seconds from the time the master started to the time the slave finished. There was no excessive slave lag caused and other resource checks showed that this test was acceptable, including application response time for non-administrative queries.

Given a modulo of 120 that takes about 10 minutes to run and the environment, we decided to run the checksum 9 times per day (hourly for 9 hours during off-peak time). This resulted in the entire data set being checked during a period of just under 2 weeks (120 parts / 9 times per day = 13.333 days).

This means that if there is a data discrepancy, it is discovered within 2 weeks in this particular environment. Though that is not ideal, it is much better than not discovering data discrepancies at all, which is how replication currently works.

Benchmarks From the Real World

The first row in the table below is the example we just went through. The subsequent rows of the table are results from some of the other production environments we run the checksum in. As you can see, we try to keep the data checked at one time to about 1 Gb.

Total Data Size	Modulo	Test time	Checksum Rate	Frequency	Period
113 Gb	120	10 min 6 seconds	1.59 Mb / sec	9x / day	2 weeks
9 Gb	21	51 seconds	8.4 Mb / sec	3x / day	1 week
29 Gb	21	9 min 16 seconds	2.6 Mb / sec	3x / day	1 week
70 Gb	21	28 seconds	2650 Mb / sec (data was freshly defragmented!)	3x / day	1 week
5.1 Gb	21	4 min 22 sec	0.958 Mb / sec	3x / day	1 week
314.5 Gb	336	36 min 3 seconds	0.44 Mb / sec	16x / day	3 weeks

In all of these environments, slave lag was 10 seconds or less at any given point on the slaves.

Issues Encountered

Some of the issues we encountered have workarounds, so I wanted to discuss and explain them here before giving our procedure, which contains the workarounds.

Bug 304 – mk-table-checksum deletes all prior checksum results in the result table on every run. There is no fix for this yet, but if you are using mk-table-checksum only for the procedure described in this article (and in particular are not using the --resume-replication option), you can comment out the following code from mk-table-checksum:

# Clean out the replication table entry for this table. if ( (my $replicate_table = $final_o->get('replicate')) && !$final_o->get('explain') ) { use_repl_db(%args); # USE the proper replicate db my $del_sql = "DELETE FROM $replicate_table WHERE db=? AND tbl=?"; MKDEBUG && _d($dbh, $del_sql, $db, $table->{table}); $dbh->do($del_sql, {}, $db, $table->{table}); }

It is in different places in different versions, but last I checked, searching for "DELETE FROM" in mk-table-checksum only matched three lines of code, and it was pretty clear (due to the inline comment) which block of code to delete. The block shown above is from lines 5154 – 5161 in mk-table-checksum changeset 6647.
Running the checksum may cause “statement not safe for replication” errors, especially in 5.1.

can

You will need to redirect stderr and expect to see those statements in the MySQL error log. Note that mk-table-checksum works regardless of whether you are using statement-based, row-based or mixed replication.

mk-table-checksum is not perfect, and sometimes shows false positives and false negatives.

Pythian’s procedure to set up continual replication sync

Check to make sure all tables have appropriate indexes, as above

Figure out the modulo value based on data size, as above

Decide what database in which to put the tables that mk-table-checksum uses. We either use our monitoring database or a database called “maatkit”. Note that it is important to use a database that actually gets replicated!

Get mk-table-checksum and comment out the lines that always delete from the replicated table, as above

Create and populate the table mk-table-checksum will need for the modulo value:
CREATE TABLE IF NOT EXISTS `checksum_modulo` ( `modulo_offset` smallint(5) unsigned NOT NULL default '0' primary key ) ENGINE=MyISAM DEFAULT CHARSET=utf8;
INSERT IGNORE INTO checksum_modulo (modulo_offset) VALUES (0);
Do a test run of mk-table-checksum:

perl mk-table-checksum -u avail --ask-pass --algorithm BIT_XOR \

--replicate maatkit.checksum --create-replicate-table \

--modulo 120 --chunk-size 100 \

--offset 'modulo_offset FROM maatkit.checksum_modulo' localhost

And update the modulo table:
update maatkit.checksum_modulo set modulo_offset = ((modulo_offset+1)%21);

And check the results on the slave:
SELECT * FROM maatkit.checksum WHERE this_crc!=master_crc AND db not in ('maatkit','mysql');

If the test run came out OK, set up a script to run via cron such as:

# run checksum table 9 times per day, during off-peak times:

30 0,1,2,3,4,5,21,22,23 * * * /home/pythian/bin/checksumtbls.sh >> /home/pythian/logs/checksum.log 2>&1

And the checksumtbls.sh script looks like:

#!/bin/sh


DB=maatkit

USER=maatkit_user

PASS=password

REPLTBL="$DB.checksum"

MODULO=120

CHUNKSIZE=100

OFFSETTBL="checksum_modulo"

ALGORITHM=BIT_XOR

LOG=/home/pythian/logs/checksum.log
/usr/bin/perl /home/pythian/bin/mk-table-checksum -u $USER -p $PASS \

--modulo $MODULO \ --algorithm $ALGORITHM --chunk-size $CHUNKSIZE \

--offset "modulo_offset FROM $DB.$OFFSETTBL" \

--replicate $REPLTBL --create-replicate-table localhost >> $LOG

/usr/bin/mysql -u $USER -p$PASS -e "update $DB.$OFFSETTBL set modulo_offset=((modulo_offset+1)%$MODULO)" >> $LOG

And of course, do not forget to periodically check on the slave to see where the issues are:

SELECT * FROM maatkit.checksum WHERE this_crc!=master_crc AND db not in ('maatkit','mysql');

I hope this helps; It is extremely important to make sure

At the 2010 MySQL User Conference & Expo, my co-worker Danil Zburivsky did a presentation about this, and you can get the slides in ODP (Open Office) format or watch the 46-minute video on YouTube.

How it works

Testing resource usage for non-blocking replication sync checking

Each environment has a different data size, different resources, and a different level of what “intrusive” means. We start with a baseline of the size of the database, which we get from:

SELECT SUM(INDEX_LENGTH+DATA_LENGTH)/1024/1024/1024 as sizeGb FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_TYPE='BASE TABLE';

To find tables that would be problematic, here is an INFORMATION_SCHEMA query that can be run (again, the usual caveats about INFORMATION_SCHEMA apply):

SELECT CONCAT("SHOW CREATE TABLE ", table_schema,".",table_name,"\\G")
FROM TABLES
LEFT JOIN (
     SELECT distinct table_schema,table_name
     FROM STATISTICS INNER JOIN COLUMNS USING
       (table_schema,table_name,column_name)
     WHERE (data_type like '%int' or data_type like 'date%'
     or data_type like 'time%' or data_type='float') and seq_in_index=1
     ) AS good_for_checksum
USING (table_schema,table_name)
WHERE good_for_checksum.table_name IS NULL
and table_schema not in ('information_schema');

Frequency of checksum runs

Benchmarks From the Real World

Total Data Size	Modulo	Test time	Checksum Rate	Frequency	Period
113 Gb	120	10 min 6 seconds	1.59 Mb / sec	9x / day	2 weeks
9 Gb	21	51 seconds	8.4 Mb / sec	3x / day	1 week
29 Gb	21	9 min 16 seconds	2.6 Mb / sec	3x / day	1 week
70 Gb	21	28 seconds	2650 Mb / sec (data was freshly defragmented!)	3x / day	1 week
5.1 Gb	21	4 min 22 sec	0.958 Mb / sec	3x / day	1 week
314.5 Gb	336	36 min 3 seconds	0.44 Mb / sec	16x / day	3 weeks

In all of these environments, slave lag was 10 seconds or less at any given point on the slaves.

Issues Encountered

Some of the issues we encountered have workarounds, so I wanted to discuss and explain them here before giving our procedure, which contains the workarounds.

Bug 304 – mk-table-checksum deletes all prior checksum results in the result table on every run. There is no fix for this yet, but if you are using mk-table-checksum only for the procedure described in this article (and in particular are not using the --resume-replication option), you can comment out the following code from mk-table-checksum:

# Clean out the replication table entry for this table. if ( (my $replicate_table = $final_o->get('replicate')) && !$final_o->get('explain') ) { use_repl_db(%args); # USE the proper replicate db my $del_sql = "DELETE FROM $replicate_table WHERE db=? AND tbl=?"; MKDEBUG && _d($dbh, $del_sql, $db, $table->{table}); $dbh->do($del_sql, {}, $db, $table->{table}); }

It is in different places in different versions, but last I checked, searching for "DELETE FROM" in mk-table-checksum only matched three lines of code, and it was pretty clear (due to the inline comment) which block of code to delete. The block shown above is from lines 5154 – 5161 in mk-table-checksum changeset 6647.
Running the checksum may cause “statement not safe for replication” errors, especially in 5.1.

can

mk-table-checksum is not perfect, and sometimes shows false positives and false negatives.

Pythian’s procedure to set up continual replication sync

Check to make sure all tables have appropriate indexes, as above

Figure out the modulo value based on data size, as above

Decide what database in which to put the tables that mk-table-checksum uses. We either use our monitoring database or a database called “maatkit”. Note that it is important to use a database that actually gets replicated!

Get mk-table-checksum and comment out the lines that always delete from the replicated table, as above

Create and populate the table mk-table-checksum will need for the modulo value:
CREATE TABLE IF NOT EXISTS `checksum_modulo` ( `modulo_offset` smallint(5) unsigned NOT NULL default '0' primary key ) ENGINE=MyISAM DEFAULT CHARSET=utf8;
INSERT IGNORE INTO checksum_modulo (modulo_offset) VALUES (0);
Do a test run of mk-table-checksum:

perl mk-table-checksum -u avail --ask-pass --algorithm BIT_XOR \

--replicate maatkit.checksum --create-replicate-table \

--modulo 120 --chunk-size 100 \

--offset 'modulo_offset FROM maatkit.checksum_modulo' localhost

And update the modulo table:
update maatkit.checksum_modulo set modulo_offset = ((modulo_offset+1)%21);

And check the results on the slave:
SELECT * FROM maatkit.checksum WHERE this_crc!=master_crc AND db not in ('maatkit','mysql');

If the test run came out OK, set up a script to run via cron such as:

# run checksum table 9 times per day, during off-peak times:

30 0,1,2,3,4,5,21,22,23 * * * /home/pythian/bin/checksumtbls.sh >> /home/pythian/logs/checksum.log 2>&1

And the checksumtbls.sh script looks like:

#!/bin/sh


DB=maatkit

USER=maatkit_user

PASS=password

REPLTBL="$DB.checksum"

MODULO=120

CHUNKSIZE=100

OFFSETTBL="checksum_modulo"

ALGORITHM=BIT_XOR

LOG=/home/pythian/logs/checksum.log
/usr/bin/perl /home/pythian/bin/mk-table-checksum -u $USER -p $PASS \

--modulo $MODULO \ --algorithm $ALGORITHM --chunk-size $CHUNKSIZE \

--offset "modulo_offset FROM $DB.$OFFSETTBL" \

--replicate $REPLTBL --create-replicate-table localhost >> $LOG

/usr/bin/mysql -u $USER -p$PASS -e "update $DB.$OFFSETTBL set modulo_offset=((modulo_offset+1)%$MODULO)" >> $LOG

And of course, do not forget to periodically check on the slave to see where the issues are:

SELECT * FROM maatkit.checksum WHERE this_crc!=master_crc AND db not in ('maatkit','mysql');

I hope this helps; It is extremely important to make sure

Webinar: What you need to know for a MySQL 5.0 -> 5.1 upgrade

IOUG has a free series of three webinars on upgrading MySQL. Each webinar is an hour long, and it starts with a webinar by me tomorrow at 12 noon Central time (GMT-5) on “Why and How to Upgrade to MySQL 5.1″. The webinar assumes you are upgrading from MySQL 5.0 to MySQL 5.1, and talks a little bit about the new features, server variables, and what you need to know when upgrading to MySQL 5.1.

The software used is GoToWebinar (formerly GoToMeeting), so you will need to install that software. To register, use the links on the IOUG MySQL Upgrade Webinar Series page.

The complete list of webinars in the MySQL Upgrade Series is:
* MySQL 5.1: Why and How to Upgrade
Sheeri Cabral, The Pythian Group
Tuesday, July 27, 12:00 p.m. – 1:00 p.m. CT (GMT-5)

* MySQL Upgrades With No Downtime
Sean Hull, Heavyweight Internet Group
Wednesday, July 28, 12:00 p.m. – 1:00 p.m. CT (GMT-5)

* MySQL Upgrade Best Practices
Matt Yonkovit, Percona
Thursday, July 29, 12:00 p.m. – 1:00 p.m. CT (GMT-5)

(note, I am not sure if it is free for everyone or just free for IOUG members; my apologies if it is the latter)

The software used is GoToWebinar (formerly GoToMeeting), so you will need to install that software. To register, use the links on the IOUG MySQL Upgrade Webinar Series page.

The complete list of webinars in the MySQL Upgrade Series is:
* MySQL 5.1: Why and How to Upgrade
Sheeri Cabral, The Pythian Group
Tuesday, July 27, 12:00 p.m. – 1:00 p.m. CT (GMT-5)

* MySQL Upgrades With No Downtime
Sean Hull, Heavyweight Internet Group
Wednesday, July 28, 12:00 p.m. – 1:00 p.m. CT (GMT-5)

* MySQL Upgrade Best Practices
Matt Yonkovit, Percona
Thursday, July 29, 12:00 p.m. – 1:00 p.m. CT (GMT-5)

(note, I am not sure if it is free for everyone or just free for IOUG members; my apologies if it is the latter)

Using MySQL Partitioning Instead of MERGE Tables

One common question I get is how to use partitioning instead of MERGE tables. The process I use involves using stored procedures to create and drop partitions. This article will go over the stored procedures I use; special thanks to Roland Bouman for taking a look and giving great feedback to optimize this process.

First, a simple table, not partitioned (yet):

use test;
DROP TABLE IF EXISTS my_part;
CREATE TABLE IF NOT EXISTS my_part (
  id int NOT NULL,
  creationDate datetime NOT NULL,
  PRIMARY KEY (id,creationDate)
) ENGINE=InnoDB;

In real, life there is more to the table than just id and creationDate. The most important part is that the partitioned field(s) need to be part of the primary key.

Now, add the partition definition. This can be done in the CREATE statement, but I have found that it is easier for me to think about what fields and indexes I want first, and then worry about partitioning, so I naturally gravitate towards defining the fields and indexes first and then altering the table to add partitioning. This also helps me think about how to modify an existing table to add partitioning.

ALTER TABLE my_part PARTITION BY RANGE (TO_DAYS(creationDate)) (
partition 2010_07_01 values less than (to_days('2010-07-02')),
partition 2010_07_02 values less than (to_days('2010-07-03')),
partition 2010_07_03 values less than (to_days('2010-07-04')),
partition 2010_07_04 values less than (to_days('2010-07-05'))
);

This makes it pretty clear what is happening — the idea is to give the partition names actual dates that they hold, so that it is easy to see what partitions need to be added and deleted.

Deleting partitions

I find that making stored procedures makes things easy….so I will define a procedure called partition_drop to drop partitions. The partition_drop stored procedure takes in a table name and schema name to drop partitions from, and a date to delete up through and including that date. Here’s the procedure:

DELIMITER ||
DROP PROCEDURE IF EXISTS partition_drop ||

CREATE PROCEDURE partition_drop (IN through_date date, IN tbl varchar(64), IN db varchar(64))
BEGIN
DECLARE delete_me varchar(64);
DECLARE notfound BOOL DEFAULT FALSE;
DECLARE pname CURSOR FOR SELECT PARTITION_NAME
FROM INFORMATION_SCHEMA.PARTITIONS WHERE TABLE_NAME=tbl AND TABLE_SCHEMA=db
AND DATE(PARTITION_NAME)!= 0
AND DATE(PARTITION_NAME) IS NOT NULL
AND DATE(PARTITION_NAME)<=through_date;

DECLARE CONTINUE HANDLER FOR NOT FOUND SET notfound:=TRUE;
OPEN pname;

cursor_loop: LOOP
FETCH pname INTO delete_me;
IF notfound THEN LEAVE cursor_loop; END IF;
SET @alter_stmt:=CONCAT("ALTER TABLE ",db,".",tbl," DROP PARTITION ",delete_me);

# sanity check commented out for production use
# SELECT @alter_stmt;
PREPARE stmt_alter FROM @alter_stmt; EXECUTE stmt_alter; DEALLOCATE PREPARE stmt_alter;

END LOOP;
CLOSE pname;
END ||
DELIMITER ;

Go ahead and run CALL partition_drop('2010-07-02','my_part','test'); to verify that SHOW CREATE TABLE my_part; shows that the desired partitions have been dropped.

Adding partitions

Adding partitions is similar to deleting partitions — using a stored procedure that takes in the date to add partitions up to that date. It will not try to add so many partitions that the table will have more than 1024, and it won’t add any partitions that already exist.

DELIMITER ||
DROP PROCEDURE IF EXISTS partition_add ||

CREATE PROCEDURE partition_add (IN through_date date, IN tbl varchar(64), IN db varchar(64))
BEGIN
DECLARE add_me char(10);
DECLARE max_new_parts,add_cnt smallint unsigned default 0;
SELECT 1024-COUNT(*) AS max_new_parts,
SUM(CASE WHEN
 DATE(PARTITION_NAME)>=through_date then 1 else 0
 END)
INTO max_new_parts, add_cnt
FROM INFORMATION_SCHEMA.PARTITIONS
WHERE  TABLE_SCHEMA = db
 AND TABLE_NAME = tbl;

IF add_cnt=0 THEN
BEGIN
SELECT MAX(DATE(PARTITION_NAME)) INTO add_me
FROM INFORMATION_SCHEMA.PARTITIONS WHERE TABLE_NAME=tbl and TABLE_SCHEMA=db
AND DATE(PARTITION_NAME)<through_date;

# to do: declare handler for exceptions here
IF DATEDIFF(through_date,add_me)+1 < max_new_parts THEN
BEGIN
WHILE add_me<through_date do BEGIN
SET add_me:=DATE_FORMAT(add_me + INTERVAL 1 DAY,"%Y_%m_%d");
SET @alter_stmt:=CONCAT("ALTER TABLE ",db,".",tbl," ADD PARTITION (PARTITION ",add_me," VALUES LESS THAN (TO_DAYS('",add_me+INTERVAL 1 DAY, "')))" );

PREPARE stmt_alter FROM @alter_stmt; EXECUTE stmt_alter; DEALLOCATE PREPARE stmt_alter;

END;
END WHILE;
END;
END IF;
END;
END IF;
END ||
DELIMITER ;

Here’s how to call that stored procedure:
CALL partition_add ('2010_07_10', 'my_part','test');

Caveat: This stored procedure will only add in partitions from the maximum partition name/date until the date you add it in, it will not fill in any gaps. However, it is possible to add in another input parameter to be the “start date” and change the stored procedure to add the partition if it does not exist, from the start date through the end date.

Note: It was pointed out by Roland Bouman that it would be better to change both the add and drop stored procedures to do one ALTER TABLE instead of sequential ones. He wrote:

It’s probably better to generate a single statement to drop / add all partitions.

So the general pattern would be:

– generate and concatenate lines for each partition
– one single sequence of prepare, execute, deallocate to execute one DDL statement.

For the drop partition procedure, this approach would allow you to do away with the cursor. You can simply use GROUP_CONCAT in a single SELECT…INTO statement to generate the entire DDL statement.

Note: Roland also mentioned that these stored procedures could be generalized to use any interval, for example if each table partition held 7 days. The only limitation is that to get the optimal partition performance with dates, use either TO_DAYS(date_field) or YEAR(date_field) as the partitionining function.

Putting it all together

The procedure: daily I would run the following in MySQL, to keep only the previous “x” days:

CALL partition_drop('CURRENT_DATE()-INTERVAL x DAY', 'tbl_name','schema_name');

To add new partitions, I would do:
CALL partition_add('CURRENT_DATE()-INTERVAL 32 DAY', 'tbl_name','schema_name');

It is run daily just in case; the side benefit is that it will catch any gaps if you do not modify the stored procedure to include a start date for when to add. Adding a partition does not take a long time, because there’s no data to reorganize.

I would make a daily MySQL event, and then everything is stored nicely in the database, and backed up with the database. Others may choose to run the stored procedures in a cron script or scheduled task (Windows), but that requires a password to be stored somewhere — either in the script, somewhere the script can read, or in a config file such as .my.cnf.

First, a simple table, not partitioned (yet):

use test;
DROP TABLE IF EXISTS my_part;
CREATE TABLE IF NOT EXISTS my_part (
  id int NOT NULL,
  creationDate datetime NOT NULL,
  PRIMARY KEY (id,creationDate)
) ENGINE=InnoDB;

In real, life there is more to the table than just id and creationDate. The most important part is that the partitioned field(s) need to be part of the primary key.

ALTER TABLE my_part PARTITION BY RANGE (TO_DAYS(creationDate)) (
partition 2010_07_01 values less than (to_days('2010-07-02')),
partition 2010_07_02 values less than (to_days('2010-07-03')),
partition 2010_07_03 values less than (to_days('2010-07-04')),
partition 2010_07_04 values less than (to_days('2010-07-05'))
);

This makes it pretty clear what is happening — the idea is to give the partition names actual dates that they hold, so that it is easy to see what partitions need to be added and deleted.

Deleting partitions

DELIMITER ||
DROP PROCEDURE IF EXISTS partition_drop ||

CREATE PROCEDURE partition_drop (IN through_date date, IN tbl varchar(64), IN db varchar(64))
BEGIN
DECLARE delete_me varchar(64);
DECLARE notfound BOOL DEFAULT FALSE;
DECLARE pname CURSOR FOR SELECT PARTITION_NAME
FROM INFORMATION_SCHEMA.PARTITIONS WHERE TABLE_NAME=tbl AND TABLE_SCHEMA=db
AND DATE(PARTITION_NAME)!= 0
AND DATE(PARTITION_NAME) IS NOT NULL
AND DATE(PARTITION_NAME)<=through_date;

DECLARE CONTINUE HANDLER FOR NOT FOUND SET notfound:=TRUE;
OPEN pname;

cursor_loop: LOOP
FETCH pname INTO delete_me;
IF notfound THEN LEAVE cursor_loop; END IF;
SET @alter_stmt:=CONCAT("ALTER TABLE ",db,".",tbl," DROP PARTITION ",delete_me);

# sanity check commented out for production use
# SELECT @alter_stmt;
PREPARE stmt_alter FROM @alter_stmt; EXECUTE stmt_alter; DEALLOCATE PREPARE stmt_alter;

END LOOP;
CLOSE pname;
END ||
DELIMITER ;

Go ahead and run CALL partition_drop('2010-07-02','my_part','test'); to verify that SHOW CREATE TABLE my_part; shows that the desired partitions have been dropped.

Adding partitions

DELIMITER ||
DROP PROCEDURE IF EXISTS partition_add ||

CREATE PROCEDURE partition_add (IN through_date date, IN tbl varchar(64), IN db varchar(64))
BEGIN
DECLARE add_me char(10);
DECLARE max_new_parts,add_cnt smallint unsigned default 0;
SELECT 1024-COUNT(*) AS max_new_parts,
SUM(CASE WHEN
 DATE(PARTITION_NAME)>=through_date then 1 else 0
 END)
INTO max_new_parts, add_cnt
FROM INFORMATION_SCHEMA.PARTITIONS
WHERE  TABLE_SCHEMA = db
 AND TABLE_NAME = tbl;

IF add_cnt=0 THEN
BEGIN
SELECT MAX(DATE(PARTITION_NAME)) INTO add_me
FROM INFORMATION_SCHEMA.PARTITIONS WHERE TABLE_NAME=tbl and TABLE_SCHEMA=db
AND DATE(PARTITION_NAME)<through_date;

# to do: declare handler for exceptions here
IF DATEDIFF(through_date,add_me)+1 < max_new_parts THEN
BEGIN
WHILE add_me<through_date do BEGIN
SET add_me:=DATE_FORMAT(add_me + INTERVAL 1 DAY,"%Y_%m_%d");
SET @alter_stmt:=CONCAT("ALTER TABLE ",db,".",tbl," ADD PARTITION (PARTITION ",add_me," VALUES LESS THAN (TO_DAYS('",add_me+INTERVAL 1 DAY, "')))" );

PREPARE stmt_alter FROM @alter_stmt; EXECUTE stmt_alter; DEALLOCATE PREPARE stmt_alter;

END;
END WHILE;
END;
END IF;
END;
END IF;
END ||
DELIMITER ;

Here’s how to call that stored procedure:
CALL partition_add ('2010_07_10', 'my_part','test');

Note: It was pointed out by Roland Bouman that it would be better to change both the add and drop stored procedures to do one ALTER TABLE instead of sequential ones. He wrote:

It’s probably better to generate a single statement to drop / add all partitions.

So the general pattern would be:

– generate and concatenate lines for each partition
– one single sequence of prepare, execute, deallocate to execute one DDL statement.

For the drop partition procedure, this approach would allow you to do away with the cursor. You can simply use GROUP_CONCAT in a single SELECT…INTO statement to generate the entire DDL statement.

Putting it all together

The procedure: daily I would run the following in MySQL, to keep only the previous “x” days:

CALL partition_drop('CURRENT_DATE()-INTERVAL x DAY', 'tbl_name','schema_name');

To add new partitions, I would do:
CALL partition_add('CURRENT_DATE()-INTERVAL 32 DAY', 'tbl_name','schema_name');

Three editions of MySQL are available

Yes, you read the title correctly — there are three editions of MySQL available, according to http://www.mysql.com/products/enterprise/server.html. Well, that page names two, and then of course there is the community edition….

From the manual page:

MySQL Enterprise Server is available in the following editions:

* MySQL Enterprise Server – Pro is the world’s most popular open source database that enables you to rapidly deliver high performance and scalable Online Transaction Processing (OLTP) applications.
* MySQL Enterprise Server – Advanced is the most comprehensive edition of MySQL. It provides all the benefits of MySQL Enterprise Server Pro and adds horizontal table and index partitioning for improving the performance and management of VLDBs (Very Large Databases).

How is “horizontal table and index partitioning” different from the regular partitioning available in MySQL 5.1?

Those of us that have been around for the past 3 or so years know that there was a point in time where there were two different editions of MySQL available, back when MySQL Enterprise and MySQL Community were actually different. But that experiment was a complete failure, and the code is now the same. MySQL Enterprise does package the software in a way that is not available to the community, specifically the quarterly service pack (QSP) releases. But the actual code….the same.

The pricing page at http://globalspecials.sun.com/store/mysql/ContentTheme/pbPage.categoryEnterprise shows that the Advanced server can be acquired for $3k (Gold) or $5k (Platinum) per year. The fee is worth it for the support MySQL will give you, but why is MySQL muddying the waters by having more “editions”, which very likely are not even different code?

(Special thanks to Aaron Macks for pointing out the existence of mysql-advanced, which was the impetus for this blog post.)

From the manual page:

MySQL Enterprise Server is available in the following editions:

* MySQL Enterprise Server – Pro is the world’s most popular open source database that enables you to rapidly deliver high performance and scalable Online Transaction Processing (OLTP) applications.
* MySQL Enterprise Server – Advanced is the most comprehensive edition of MySQL. It provides all the benefits of MySQL Enterprise Server Pro and adds horizontal table and index partitioning for improving the performance and management of VLDBs (Very Large Databases).

How is “horizontal table and index partitioning” different from the regular partitioning available in MySQL 5.1?

(Special thanks to Aaron Macks for pointing out the existence of mysql-advanced, which was the impetus for this blog post.)

INFORMATION_SCHEMA tables are case sensitive

I wanted to get examples of some of the extra information that the Percona server has in its INFORMATION_SCHEMA metadata, and in doing so, I stumbled across an interesting MySQL bug/feature/point — INFORMATION_SCHEMA tables (which are actually system views) are case sensitive when used in comparisons:

mysql> use information_schema;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql>  select @@version;
+--------------------+
| @@version          |
+--------------------+
| 5.1.36-xtradb6-log |
+--------------------+
1 row in set (0.00 sec)

mysql> use information_schema;
Database changed
mysql> show tables like 'innodb%';
Empty set (0.00 sec)

mysql>  show tables like 'INNODB%';
+----------------------------------------+
| Tables_in_information_schema (INNODB%) |
+----------------------------------------+
| INNODB_BUFFER_POOL_PAGES_INDEX         |
| INNODB_RSEG                            |
| INNODB_LOCKS                           |
| INNODB_BUFFER_POOL_PAGES               |
| INNODB_TRX                             |
| INNODB_INDEX_STATS                     |
| INNODB_LOCK_WAITS                      |
| INNODB_CMP_RESET                       |
| INNODB_CMP                             |
| INNODB_CMPMEM_RESET                    |
| INNODB_BUFFER_POOL_PAGES_BLOB          |
| INNODB_CMPMEM                          |
| INNODB_TABLE_STATS                     |
+----------------------------------------+
13 rows in set (0.00 sec)

It is not just for the new tables Percona has added:

mysql> show tables like 'table%';

Empty set (0.00 sec)

mysql> show tables like 'TABLE%';

+---------------------------------------+
| Tables_in_information_schema (TABLE%) |
+---------------------------------------+
| TABLES                                |
| TABLE_CONSTRAINTS                     |
| TABLE_PRIVILEGES                      |
+---------------------------------------+
3 rows in set (0.00 sec)

And it is not due to the collation:

mysql> show global variables like '%collat%';
+----------------------+-------------------+
| Variable_name        | Value             |
+----------------------+-------------------+
| collation_connection | latin1_swedish_ci |
| collation_database   | latin1_swedish_ci |
| collation_server     | latin1_swedish_ci |
+----------------------+-------------------+
3 rows in set (0.00 sec)

mysql> show session variables like '%collat%';
+----------------------+-------------------+
| Variable_name        | Value             |
+----------------------+-------------------+
| collation_connection | latin1_swedish_ci |
| collation_database   | utf8_general_ci   |
| collation_server     | latin1_swedish_ci |
+----------------------+-------------------+
3 rows in set (0.00 sec)

mysql> use information_schema;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql>  select @@version;
+--------------------+
| @@version          |
+--------------------+
| 5.1.36-xtradb6-log |
+--------------------+
1 row in set (0.00 sec)

mysql> use information_schema;
Database changed
mysql> show tables like 'innodb%';
Empty set (0.00 sec)

mysql>  show tables like 'INNODB%';
+----------------------------------------+
| Tables_in_information_schema (INNODB%) |
+----------------------------------------+
| INNODB_BUFFER_POOL_PAGES_INDEX         |
| INNODB_RSEG                            |
| INNODB_LOCKS                           |
| INNODB_BUFFER_POOL_PAGES               |
| INNODB_TRX                             |
| INNODB_INDEX_STATS                     |
| INNODB_LOCK_WAITS                      |
| INNODB_CMP_RESET                       |
| INNODB_CMP                             |
| INNODB_CMPMEM_RESET                    |
| INNODB_BUFFER_POOL_PAGES_BLOB          |
| INNODB_CMPMEM                          |
| INNODB_TABLE_STATS                     |
+----------------------------------------+
13 rows in set (0.00 sec)

It is not just for the new tables Percona has added:

mysql> show tables like 'table%';

Empty set (0.00 sec)

mysql> show tables like 'TABLE%';

+---------------------------------------+
| Tables_in_information_schema (TABLE%) |
+---------------------------------------+
| TABLES                                |
| TABLE_CONSTRAINTS                     |
| TABLE_PRIVILEGES                      |
+---------------------------------------+
3 rows in set (0.00 sec)

And it is not due to the collation:

mysql> show global variables like '%collat%';
+----------------------+-------------------+
| Variable_name        | Value             |
+----------------------+-------------------+
| collation_connection | latin1_swedish_ci |
| collation_database   | latin1_swedish_ci |
| collation_server     | latin1_swedish_ci |
+----------------------+-------------------+
3 rows in set (0.00 sec)

mysql> show session variables like '%collat%';
+----------------------+-------------------+
| Variable_name        | Value             |
+----------------------+-------------------+
| collation_connection | latin1_swedish_ci |
| collation_database   | utf8_general_ci   |
| collation_server     | latin1_swedish_ci |
+----------------------+-------------------+
3 rows in set (0.00 sec)

The XLDB4 Conference for Very Large Databases

Ronald saved me a post by giving his feedback on a few Oracle conferences that now have MySQL content.

My opinion is pretty much a summary of Ronald’s post, so I won’t repeat it here. Instead, I’ll post about a conference he did not, the 4th Extremely Large Databases Conference. I am particularly interested in any MySQL folks planning to attend (I would expect Tokutek to be represented, and maybe even the
“http://www.calpont.com/”>Calpont folks).

Most of this is directly from an e-mail I received from Jacek Becla, who had a keynote at the 2008 MySQL User Conference and Expo. If you also received this e-mail, please feel free to skip ahead to my viewpoints on the various Oracle conferences (or just skip altogether).

4th Extremely Large Databases (XLDB4) Conference
October 6-7, 2010
SLAC National Accelerator Laboratory
Menlo Park, California
http://www.xldb.org/4

XLDB, now including a conference open to all interested parties, is the gathering place for the community managing and analyzing data at extreme scale.

Leading practitioners from science, industry, and academia will discuss lessons learned and real-world solutions for handling terabytes and petabytes of data. Experts will present emerging trends that will significantly impact how extremely large databases are built and used.

More information, including the program, and online registration are available at http://www.xldb.org/4. Space is limited, so if you would like to attend, make sure to register soon. Early registration ($100) ends on July 31; late registration is $150.

If you are uncertain whether to attend, I would recommend reading the article recently written by Curt Monash:
http://www.dbms2.com/2010/07/01/why-you-should-go-to-xldb4/

For questions, please contact us by email using
xldb-admin at slac.stanford.edu or contact me [Jacek] directly at
becla at slac.stanford.edu.

Ronald saved me a post by giving his feedback on a few Oracle conferences that now have MySQL content.

4th Extremely Large Databases (XLDB4) Conference
October 6-7, 2010
SLAC National Accelerator Laboratory
Menlo Park, California
http://www.xldb.org/4

XLDB, now including a conference open to all interested parties, is the gathering place for the community managing and analyzing data at extreme scale.

If you are uncertain whether to attend, I would recommend reading the article recently written by Curt Monash:
http://www.dbms2.com/2010/07/01/why-you-should-go-to-xldb4/

For questions, please contact us by email using
xldb-admin at slac.stanford.edu or contact me [Jacek] directly at
becla at slac.stanford.edu.

ODTUG Kscope Wrap-up and Slides

Ronald Bradford and I produced a successful MySQL track at Kaleidoscope (hereinafter referred to as Kscope). With a speaker list of Philip Antoniades, Josh Sled and Craig Sylvester of Oracle, Laine Campbell of PalominoDB, Patrick Galbraith of Northscale, Sarah Novotny of Blue Gecko, Padrig O’Sullivan of Akiba, Dossy Shiobara of Panoptic.com and Matt Yonkovic of Percona, we knew the technical content was going to be great.

As someone who’s helped organize all the OpenSQLCamps, a few MySQL Camps, and the Boston MySQL User Group, I know that participation at an event such as this can be small. Despite planning the MySQL track at the last minute, we had top-notch speakers with appropriate content for the audience, which was mostly Oracle crossovers. We had several registrants who came solely for the MySQL content, with all but 2 of the 27 sessions having 10-25 audience members. According to a few different folks, this is the same amount as the SOA and BPM track receives, and that track was not planned at the last minute. The ODTUG conference committee and board were happy with the turnout as well. I can’t wait to see the results of the evaluations!

As someone who’s sat on not-for-profit boards in the past as well as organized events while being on the board, I know how crazy it can be to plan conferences, and I also know that there are some organizations and personalities that are difficult to work with. I am happy to report that ODTUG has been very welcoming, accommodating, and hands-off, letting us do what we need. When I spoke with Edward Roske, the conference chair for Kscope 2011, he said, “You know what the needs for MySQL much better than I do, so just run the MySQL track as if you’re an executive of a corporation.” I know we are all worried about content being controlled by marketing folks, and I am very happy to report that ODTUG’s Kaleidoscope conference is of, by and for the people. There is one slot for a vendor presentation, and it is clearly marked, and there was an expo hall with over 20 booths, so there are opportunities for marketing; but all in all this is a technical conference.

Slides from presentations:
From Ronald:
Increasing MySQL Productivity from Design to Implementation (3-hour presentation)

MySQL idiosyncrasies that bite

From Matt Yonkovit:
The Five Minute DBA

From me:
Importing and Exporting Data with MySQL
What do you mean, SQL syntax error? – a 90-minute look into how MySQL’s SQL extends and deviates from the ANSI/ISO SQL:2003 standard.

Stored
Procedures & Functions and
Triggers and Views.

I also gave Jay Pipes’ Join-fu: the Art of SQL part one and Join-fu: the Art of SQL part two as Jay was unable to attend at the last minute.

My huge thanks to all the speakers and all the attendees; the MySQL track at Kaleidoscope was a success! I am already imagining what we can do next year in Long Beach, CA at the end of June.