Data Warehousing Tips and Tricks

It’s not easy to do a DW in MySQL — but it’s not impossible either. Easier to go to Teradata than to write your own.

DW characteristics:

1) Organic, evolves over time from OLTP systems — issues, locking, large queries, # of userss.

2) Starts as a copy of OLTP, but changes over time — schema evolution, replication lag, duplicate data issues

3) Custom — designed from the ground up for DW — issues with getting it started, growth, aggregations, backup.

4) How do you update the data in the warehouse? — write/update/read/delete, write/read/delete, or write only — which means that roll out requires partitions or merge tables.

The secret to DW is partitioning — can be based on:
data — date, groups like department, company, etc.
functional — sales, HR, etc.
random — hash, mod on a primary key.

You can partition:
manually — unions, application logic, etc.
using MERGE tables and MyISAM
MySQL 5.1 using partitions

You can load, backup and purge by partition, so perhaps keeping that logic intact — if it takes too much work to load a partition because you’ve grouped it oddly, then your partitioning schema isn’t so great.

Make sure your partitioning is flexible — you need to plan for growth from day 1. So don’t just partition once and forget about it, make sure you can change the partitioning schema without too much trouble. Hash and modulo partitioning aren’t very flexible, and you have to restructure your data to do so.

Use MyISAM for data warehousing — 3-4 times faster than InnoDB, data 2-3 times smaller, MyISAM table files can be easily copied from one server to another, MERGE tables available only over MyISAM tables (scans are 10-15% faster with merge tables), and you can make read-only tables (compressed with indexes) to reduce data size further. ie, compress older data (a year ago, or a week ago if it doesn’t change!)

Issues for using MyISAM for DW — Table locking for high volumes of real-time data (concurrent inserts are allowed when there is ONLY insertions going on, not deletions). This is where partitioning comes in! REPAIR TABLE also takes a long time — better to backup frequently, saving tables, loadset and logs, and then instead of REPAIR TABLE do a point-in-time recovery. For write-only DW, save your write-loads and use that as part of your backup strategy.

Deletes will break concurrent inserts — delayed inserts still work, but they’re not as efficient. You also have to program that in, you can’t, say, replicate using INSERT DELAYED where the master had INSERT.

[Baron’s idea — take current data in InnoDB format, and UNION over other DW tables]

No index clustering for queries that need it — OPTIMIZE TABLE will fix this but it can take a long time to run.

When to use InnoDB — if you must have a high volume of realtime loads — InnoDB record locking is better.

If ALL of your queries can take advantage of index clustering — most or all queries access the data using the primary key (bec. all indexes are clustered together with the primary key, so non-primary key lookups are much faster than regular non-primary key lookups in MySIAM). BUT this means you want to keep your primary keys small. Plus, the more indexes you have, the slower your inserts are, and moreso because of the clustering.

MEMORY storage engine: Use it when you have smaller tables that aren’t updated very often; they’re faster and support hash indexes, which are better for doing single record lookups.

Store the data for the MEMORY engine twice, once in the MEMORY table and once in MyISAM or InnoDB, add queries to the MySQL init script to copy the data from the disk tables to the MEMORY tables upon restart using –init-file=< file name >

ARCHIVE storage engine — use to store older data. More compression than compressed MyISAM, fast inserts, 5.1 supports limited indexes, good performance for full table scans.

Nitro Storage Engine — very high INSERT rates w/ simultaneous queries. Ultra high performance on aggregate operations on index values. Doesn’t require 64-bit server, runs well on 32-bit machines. High performance scans on temporal data, can use partial indexes in ways other engines can’t. http://www.nitrosecurity.com

InfoBright Storage Engine — best compression of all storage engines — 10:1 compression, peak can be as high as 30:1 — includes equivalent of indexes for complex analysis queries. High batch load rates — up to 65GB per hour! Right now it’s Windows only, Linux and other to come. Very good performance for analysis type queries, even working with >5TB data. http://www.infobright.com

Backup — For small tables, just back up. Best option for large tables is copying the data files. If you have a write-only/roll out DB you only need to copy the newly added tables. So you don’t need to keep backing up the same data, just backup the new stuff. Or, just save the load sets. Just backup what changes, and partition smartly.

Tips:
Use INSERT . . . ON DUPLICATE KEY UPDATE to build aggregate tables, when the tables are very large and sorts go to disk, or when you need it real time.

Emulating Star Schema Optimization & Hash Joins — MySQL doesn’t do these, except MEMORY tables can use has indexes. So use a MEMORY INDEX table and optimizer hints to manually do a star schema optimized hash join. Steps:

1) Create a query to filter the fact table
to select all sales from week 1-5 and display by region & store type:

SELECT D.week, S.totalsales, S.locationID, S.storeID
FROM sales S INNER JOIN date D USING (dateID)
WHERE D.week BETWEEN 1 AND 5;

Access only the tables you need for filtering the data, but select the foreign key ID’s.

2) Join the result from step 1 with other facts/tables needed for the report

(SELECT D.week, S.totalsales, S.locationID, S.storeID
FROM sales S INNER JOIN date D USING (dateID)
WHERE D.week BETWEEN 1 AND 5) AS R
INNER JOIN location AS L ON (L.locationID=R.locationID) INNER JOIN store AS S ON (S.storeId=R.storeId);

3) Aggregate the results

(SELECT D.week, S.totalsales, S.locationID, S.storeID
FROM sales S INNER JOIN date D USING (dateID)
WHERE D.week BETWEEN 1 AND 5) AS R
INNER JOIN location AS L ON (L.locationID=R.locationID) INNER JOIN store AS S ON (S.storeId=R.storeId)
GROUP BY week, region, store_type;

Critical configuration options for DW — sort_buffer_size — used to do SELECT DISTINCT, GROUP BY, ORDER BY, UNION DISTINCT (or just UNION)

Watch the value of sort_merge_passes (more than 1 per second or 4-5 per minute) to see if you need to increase sort_buffer_size. sort_buffer_size is a PER-CONNECTION parameter, so don’t be too too greedy…..but it can also be increased dynamically before running a large query, and reduced afterwards.

key_buffer_size – use multiplekey buffer caches. Use difference caches for hot, warm & cold indexes. Preload your key caches at server startup. Try to use 1/4 of memory (up to 4G per key_buffer) for your total key buffer space. Monitor the cache hit rate by watching:

Read hit rate = key_reads/key_read_requests
Write hit rate = key_writes/key_write_requests
Key_reads & key_writes per second are also important.

hot_cache.key_buffer_size = 1G
fred.key_buffer_size = 1G
fred.key_cache_division_limit = 80
key_cache_size = 2G
key_cache_division_limit = 60
init-file = my_init_file.sql

in the init file:

CACHE INDEX T1,T2,T3 INDEX (I1,I2) INTO hot_cache;
CACHE INDEX T4,T5,T3 INDEX (I3,I4) INTO fred;
LOAD INDEX INTO CACHE T1,T3 NO LEAVES; — use when cache isn’t big enough to hold the whole index.
LOAD INDEX INTO CACHE T10, T11, T2, T4, T5

http://dev.mysql.com/doc/refman/5.0/en/myisam-key-cache.html

This was implemented in MySQL 4.1.1

Temporary table sizes — monitor created_disk_tmp_tables — more than a few per minute is bad, one a minute could be bad depending on the query. tmp tables start in memory and then go to disk…increase tmp_table_size and max_heap_table_size — can by done by session, for queries that need >64MB or so of space.

ALWAYS turn on the slow query log! save them for a few logs, use mysqldumpslow to analyze queries daily. Best to have an automated script to run mysqldumpslow and e-mail a report with the 10-25 worst queries.

log_queries_not_using_indexes unless your DW is designed to use explicit full-table scans.

Learn what the explain plan output means & how the optimizer works:
http://forge.mysql.com/wiki/MySQL_Internals_Optimizer

Other key status variables to watch
select_scan — full scan of first table
select_full_join — # of joins doing full table scan ’cause not using indexes
sort_scan — # of sorts that require
table_locks_waited
uptime

mysqladmin extended:
mysqladmin -u user -ppasswd ex =i60 -r | tee states.log | grep -v ‘0’

(runs every 60 seconds, display only status variables that have changed, logs full status to stats.log every 60 seconds).

It’s not easy to do a DW in MySQL — but it’s not impossible either. Easier to go to Teradata than to write your own.

DW characteristics:

1) Organic, evolves over time from OLTP systems — issues, locking, large queries, # of userss.

2) Starts as a copy of OLTP, but changes over time — schema evolution, replication lag, duplicate data issues

3) Custom — designed from the ground up for DW — issues with getting it started, growth, aggregations, backup.

4) How do you update the data in the warehouse? — write/update/read/delete, write/read/delete, or write only — which means that roll out requires partitions or merge tables.

The secret to DW is partitioning — can be based on:
data — date, groups like department, company, etc.
functional — sales, HR, etc.
random — hash, mod on a primary key.

You can partition:
manually — unions, application logic, etc.
using MERGE tables and MyISAM
MySQL 5.1 using partitions

You can load, backup and purge by partition, so perhaps keeping that logic intact — if it takes too much work to load a partition because you’ve grouped it oddly, then your partitioning schema isn’t so great.

Make sure your partitioning is flexible — you need to plan for growth from day 1. So don’t just partition once and forget about it, make sure you can change the partitioning schema without too much trouble. Hash and modulo partitioning aren’t very flexible, and you have to restructure your data to do so.

Use MyISAM for data warehousing — 3-4 times faster than InnoDB, data 2-3 times smaller, MyISAM table files can be easily copied from one server to another, MERGE tables available only over MyISAM tables (scans are 10-15% faster with merge tables), and you can make read-only tables (compressed with indexes) to reduce data size further. ie, compress older data (a year ago, or a week ago if it doesn’t change!)

Issues for using MyISAM for DW — Table locking for high volumes of real-time data (concurrent inserts are allowed when there is ONLY insertions going on, not deletions). This is where partitioning comes in! REPAIR TABLE also takes a long time — better to backup frequently, saving tables, loadset and logs, and then instead of REPAIR TABLE do a point-in-time recovery. For write-only DW, save your write-loads and use that as part of your backup strategy.

Deletes will break concurrent inserts — delayed inserts still work, but they’re not as efficient. You also have to program that in, you can’t, say, replicate using INSERT DELAYED where the master had INSERT.

[Baron’s idea — take current data in InnoDB format, and UNION over other DW tables]

No index clustering for queries that need it — OPTIMIZE TABLE will fix this but it can take a long time to run.

When to use InnoDB — if you must have a high volume of realtime loads — InnoDB record locking is better.

If ALL of your queries can take advantage of index clustering — most or all queries access the data using the primary key (bec. all indexes are clustered together with the primary key, so non-primary key lookups are much faster than regular non-primary key lookups in MySIAM). BUT this means you want to keep your primary keys small. Plus, the more indexes you have, the slower your inserts are, and moreso because of the clustering.

MEMORY storage engine: Use it when you have smaller tables that aren’t updated very often; they’re faster and support hash indexes, which are better for doing single record lookups.

Store the data for the MEMORY engine twice, once in the MEMORY table and once in MyISAM or InnoDB, add queries to the MySQL init script to copy the data from the disk tables to the MEMORY tables upon restart using –init-file=< file name >

ARCHIVE storage engine — use to store older data. More compression than compressed MyISAM, fast inserts, 5.1 supports limited indexes, good performance for full table scans.

Nitro Storage Engine — very high INSERT rates w/ simultaneous queries. Ultra high performance on aggregate operations on index values. Doesn’t require 64-bit server, runs well on 32-bit machines. High performance scans on temporal data, can use partial indexes in ways other engines can’t. http://www.nitrosecurity.com

InfoBright Storage Engine — best compression of all storage engines — 10:1 compression, peak can be as high as 30:1 — includes equivalent of indexes for complex analysis queries. High batch load rates — up to 65GB per hour! Right now it’s Windows only, Linux and other to come. Very good performance for analysis type queries, even working with >5TB data. http://www.infobright.com

Backup — For small tables, just back up. Best option for large tables is copying the data files. If you have a write-only/roll out DB you only need to copy the newly added tables. So you don’t need to keep backing up the same data, just backup the new stuff. Or, just save the load sets. Just backup what changes, and partition smartly.

Tips:
Use INSERT . . . ON DUPLICATE KEY UPDATE to build aggregate tables, when the tables are very large and sorts go to disk, or when you need it real time.

Emulating Star Schema Optimization & Hash Joins — MySQL doesn’t do these, except MEMORY tables can use has indexes. So use a MEMORY INDEX table and optimizer hints to manually do a star schema optimized hash join. Steps:

1) Create a query to filter the fact table
to select all sales from week 1-5 and display by region & store type:

SELECT D.week, S.totalsales, S.locationID, S.storeID
FROM sales S INNER JOIN date D USING (dateID)
WHERE D.week BETWEEN 1 AND 5;

Access only the tables you need for filtering the data, but select the foreign key ID’s.

2) Join the result from step 1 with other facts/tables needed for the report

(SELECT D.week, S.totalsales, S.locationID, S.storeID
FROM sales S INNER JOIN date D USING (dateID)
WHERE D.week BETWEEN 1 AND 5) AS R
INNER JOIN location AS L ON (L.locationID=R.locationID) INNER JOIN store AS S ON (S.storeId=R.storeId);

3) Aggregate the results

(SELECT D.week, S.totalsales, S.locationID, S.storeID
FROM sales S INNER JOIN date D USING (dateID)
WHERE D.week BETWEEN 1 AND 5) AS R
INNER JOIN location AS L ON (L.locationID=R.locationID) INNER JOIN store AS S ON (S.storeId=R.storeId)
GROUP BY week, region, store_type;

Critical configuration options for DW — sort_buffer_size — used to do SELECT DISTINCT, GROUP BY, ORDER BY, UNION DISTINCT (or just UNION)

Watch the value of sort_merge_passes (more than 1 per second or 4-5 per minute) to see if you need to increase sort_buffer_size. sort_buffer_size is a PER-CONNECTION parameter, so don’t be too too greedy…..but it can also be increased dynamically before running a large query, and reduced afterwards.

key_buffer_size – use multiplekey buffer caches. Use difference caches for hot, warm & cold indexes. Preload your key caches at server startup. Try to use 1/4 of memory (up to 4G per key_buffer) for your total key buffer space. Monitor the cache hit rate by watching:

Read hit rate = key_reads/key_read_requests
Write hit rate = key_writes/key_write_requests
Key_reads & key_writes per second are also important.

hot_cache.key_buffer_size = 1G
fred.key_buffer_size = 1G
fred.key_cache_division_limit = 80
key_cache_size = 2G
key_cache_division_limit = 60
init-file = my_init_file.sql

in the init file:

CACHE INDEX T1,T2,T3 INDEX (I1,I2) INTO hot_cache;
CACHE INDEX T4,T5,T3 INDEX (I3,I4) INTO fred;
LOAD INDEX INTO CACHE T1,T3 NO LEAVES; — use when cache isn’t big enough to hold the whole index.
LOAD INDEX INTO CACHE T10, T11, T2, T4, T5

http://dev.mysql.com/doc/refman/5.0/en/myisam-key-cache.html

This was implemented in MySQL 4.1.1

Temporary table sizes — monitor created_disk_tmp_tables — more than a few per minute is bad, one a minute could be bad depending on the query. tmp tables start in memory and then go to disk…increase tmp_table_size and max_heap_table_size — can by done by session, for queries that need >64MB or so of space.

ALWAYS turn on the slow query log! save them for a few logs, use mysqldumpslow to analyze queries daily. Best to have an automated script to run mysqldumpslow and e-mail a report with the 10-25 worst queries.

log_queries_not_using_indexes unless your DW is designed to use explicit full-table scans.

Learn what the explain plan output means & how the optimizer works:
http://forge.mysql.com/wiki/MySQL_Internals_Optimizer

Other key status variables to watch
select_scan — full scan of first table
select_full_join — # of joins doing full table scan ’cause not using indexes
sort_scan — # of sorts that require
table_locks_waited
uptime

mysqladmin extended:
mysqladmin -u user -ppasswd ex =i60 -r | tee states.log | grep -v ‘0’

(runs every 60 seconds, display only status variables that have changed, logs full status to stats.log every 60 seconds).

Comments are closed.