In short:
Register / see who’s coming
Schedule (will be filled in with presentations before Saturday noon)
Session ideas (45-minute sessions)
Friday, October 15th – 6-10 pm, WorkBar Boston, 711 Atlantic Ave, Boston, in the basement. Socializing, swag, raffles, dinner, beer and soft drinks. Take public transit (South Station on the Red Line subway or Silver Line bus if coming from the airport) or a cab; parking can be quite expensive in that area.
Saturday, October 16th – 8:30 am – 5 pm, MIT Stata Center 1st floor, 32 Vassar Street, Cambridge. Breakfast, lunch, tech presentations. A short walk from the Kendall Square subway stop on the Red Line, or drive an park in any MIT lot — even if it says parking permit only, that does not apply on the weekends.
Sunday, October 17th – same as Saturday
—–
The longer form:
As many of you know, OpenSQLCamp Boston kicks off tomorrow night with a social event at WorkBar Boston from 6-10 pm, and will include a buffet dinner from the Pulse Cafe. Even though WorkBar Boston is more “work” than “bar” – it is a coworking space – there will be beer as well (special thanks to IOUG for sponsoring this event in particular). Whether or not you are drinking, I strongly recommend taking public transit or a taxi — the location is across the street from South Station, a major bus and train hub. South Station is on the Red Line of the subway, and there is also a Silver Line bus directly from the airport terminals. The subway and Silver Line fare is $2.00.
Make sure to get sleep because Saturday starts at 8:30 am at the MIT Stata center, 32 Vassar Street. We start with breakfast, and then after a few introductory remarks we start making the schedule at 9:30 am. Then there are 3 45-minute sessions, with lunch at 1 pm, a panel on indexing from 2-3 pm, and from 3-5 is open time to ask questions, work on projects that were discussed during the day, and otherwise hack during the hackathon.
OpenSQLCamp does not provide dinner, but usually at the end of the day people figure out where they want to go next, and we all go over to a bar or restaurant (or go to a few different ones depending on people’s preferences and tastes).
Sunday is the same schedule as Saturday, except there is an extra session slot because we do not need opening remarks and the planning session.
Here’s the detail of food, for those who are wondering:
Friday night catered by Pulse Cafe
Vegan and vegetarian appetizers, wraps (incl. vegan), salad, vegetarian lasagna. Beer, soda, water, iced tea.
Saturday and Sunday breakfast catered by Panera bread
Fruit, bagels, pastries, hot egg & cheese and ham, egg & cheese sandwiches, coffee, tea.
Saturday lunch catered by Greek Corner
Hummus, grape leaves, Pastitso, Falafel, Gyros, Greek Salad (feta on the side)
Sunday lunch catered by Pita Pit
Assorted pitas including meat, vegetarian and vegan options.
I am very excited, and can’t wait to see you there!
In short:
Register / see who’s coming
Schedule (will be filled in with presentations before Saturday noon)
Session ideas (45-minute sessions)
Friday, October 15th – 6-10 pm, WorkBar Boston, 711 Atlantic Ave, Boston, in the basement. Socializing, swag, raffles, dinner, beer and soft drinks. Take public transit (South Station on the Red Line subway or Silver Line bus if coming from the airport) or a cab; parking can be quite expensive in that area.
Saturday, October 16th – 8:30 am – 5 pm, MIT Stata Center 1st floor, 32 Vassar Street, Cambridge. Breakfast, lunch, tech presentations. A short walk from the Kendall Square subway stop on the Red Line, or drive an park in any MIT lot — even if it says parking permit only, that does not apply on the weekends.
Sunday, October 17th – same as Saturday
—–
The longer form:
As many of you know, OpenSQLCamp Boston kicks off tomorrow night with a social event at WorkBar Boston from 6-10 pm, and will include a buffet dinner from the Pulse Cafe. Even though WorkBar Boston is more “work” than “bar” – it is a coworking space – there will be beer as well (special thanks to IOUG for sponsoring this event in particular). Whether or not you are drinking, I strongly recommend taking public transit or a taxi — the location is across the street from South Station, a major bus and train hub. South Station is on the Red Line of the subway, and there is also a Silver Line bus directly from the airport terminals. The subway and Silver Line fare is $2.00.
Make sure to get sleep because Saturday starts at 8:30 am at the MIT Stata center, 32 Vassar Street. We start with breakfast, and then after a few introductory remarks we start making the schedule at 9:30 am. Then there are 3 45-minute sessions, with lunch at 1 pm, a panel on indexing from 2-3 pm, and from 3-5 is open time to ask questions, work on projects that were discussed during the day, and otherwise hack during the hackathon.
OpenSQLCamp does not provide dinner, but usually at the end of the day people figure out where they want to go next, and we all go over to a bar or restaurant (or go to a few different ones depending on people’s preferences and tastes).
Sunday is the same schedule as Saturday, except there is an extra session slot because we do not need opening remarks and the planning session.
Here’s the detail of food, for those who are wondering:
Friday night catered by Pulse Cafe
Vegan and vegetarian appetizers, wraps (incl. vegan), salad, vegetarian lasagna. Beer, soda, water, iced tea.
Saturday and Sunday breakfast catered by Panera bread
Fruit, bagels, pastries, hot egg & cheese and ham, egg & cheese sandwiches, coffee, tea.
Saturday lunch catered by Greek Corner
Hummus, grape leaves, Pastitso, Falafel, Gyros, Greek Salad (feta on the side)
Sunday lunch catered by Pita Pit
Assorted pitas including meat, vegetarian and vegan options.
I am very excited, and can’t wait to see you there!
I try to eliminate warnings and errors in the error log as much as possible – not just fixing actual issues, but also to eliminate the noise in the error log. As I have been doing 5.1 upgrades lately, I have been seeing a lot of the following errors in the error log when upgrading:
[Warning] ‘–log_slow_queries’ is deprecated and will be removed in a future release. Please use ”–slow_query_log’/’–slow_query_log_file” instead.
And every time that happens I have to remember the proper syntax for slow query logging. Finally I just wrote it in my ongoing cheat sheet, it’s just a short text document that I keep on my desktop, and items rotate in and out of it frequently.
Without further ado, here is my cheat sheet for upgrading your slow query log from MySQL 5.0 to MySQL 5.1:
change
log-slow-queries=/var/log/mysqld/log-slow-queries.log
to
log_output=FILE # also can be FILE,TABLE or TABLE or NONE
slow-query-log=1
slow_query_log_file=/var/log/mysqld/log-slow-queries.log
Note that setting log_output will also set the general_log output. It does not change anything about error logging.
I try to eliminate warnings and errors in the error log as much as possible – not just fixing actual issues, but also to eliminate the noise in the error log. As I have been doing 5.1 upgrades lately, I have been seeing a lot of the following errors in the error log when upgrading:
[Warning] ‘–log_slow_queries’ is deprecated and will be removed in a future release. Please use ”–slow_query_log’/’–slow_query_log_file” instead.
And every time that happens I have to remember the proper syntax for slow query logging. Finally I just wrote it in my ongoing cheat sheet, it’s just a short text document that I keep on my desktop, and items rotate in and out of it frequently.
Without further ado, here is my cheat sheet for upgrading your slow query log from MySQL 5.0 to MySQL 5.1:
change
log-slow-queries=/var/log/mysqld/log-slow-queries.log
to
log_output=FILE # also can be FILE,TABLE or TABLE or NONE
slow-query-log=1
slow_query_log_file=/var/log/mysqld/log-slow-queries.log
Note that setting log_output will also set the general_log output. It does not change anything about error logging.
Sarah Sproehnle is an excellent speaker, so I was excited to see her speak about Hadoop, which I know very little about…so here’s the liveblogging:
Hadoop is an “open source framework for storing and processing data on a cluster of computers.” Processing is key — analysis of data.
Sarah Sproehnle is an excellent speaker, so I was excited to see her speak about Hadoop, which I know very little about…so here’s the liveblogging:
Hadoop is an “open source framework for storing and processing data on a cluster of computers.” Processing is key — analysis of data.
Today marks my last day at Pythian. I have been at Pythian for almost three years. In those three years, Pythian’s already thriving MySQL practice has grown even more. I have worked with big and small clients alike, across many industries, managed a team of up to 4 DBAs, and learned a lot not just about MySQL, but what my goals are in general.
Though I am leaving, everything I said in the blog post I made when I announced I was coming to Pythian still holds true. Pythian is a challenging environment and one I would recommend to anyone who finds their current DBA environment boring that they should come to Pythian and experience what it is like to work here. I had lunch with Paul Vallee yesterday and we even discussed possible future collaborations (hence the title, a joke that I am “forking” off of Pythian).
So if it is so great, why am I leaving? It’s simple, really — Pythian is growing by leaps and bounds. I started when Pythian was about half the size it currently is. There is a lot of change happening within Pythian, and I believe it is very good change. However, I enjoyed the environment Pythian was when I started almost three years ago, and personally I am not ready to go with Pythian on the journey it is taking.
So where am I going next? For starters, I will take the month of August off paid work. I have an idea of where I might go for paid work in September, but you will have to watch Planet MySQL for the announcement. During August I will be doing some conference planning and organizing, for OpenSQLCamp in Boston in October first, and then for conferences in 2011. I will also be moving apartments, which is a big task. And I will be focusing on some personal goals, such as spending more time with my husband and becoming more active.
I am excited about having a month off, even though I have a lot to work on in that month.
Today marks my last day at Pythian. I have been at Pythian for almost three years. In those three years, Pythian’s already thriving MySQL practice has grown even more. I have worked with big and small clients alike, across many industries, managed a team of up to 4 DBAs, and learned a lot not just about MySQL, but what my goals are in general.
Though I am leaving, everything I said in the blog post I made when I announced I was coming to Pythian still holds true. Pythian is a challenging environment and one I would recommend to anyone who finds their current DBA environment boring that they should come to Pythian and experience what it is like to work here. I had lunch with Paul Vallee yesterday and we even discussed possible future collaborations (hence the title, a joke that I am “forking” off of Pythian).
So if it is so great, why am I leaving? It’s simple, really — Pythian is growing by leaps and bounds. I started when Pythian was about half the size it currently is. There is a lot of change happening within Pythian, and I believe it is very good change. However, I enjoyed the environment Pythian was when I started almost three years ago, and personally I am not ready to go with Pythian on the journey it is taking.
So where am I going next? For starters, I will take the month of August off paid work. I have an idea of where I might go for paid work in September, but you will have to watch Planet MySQL for the announcement. During August I will be doing some conference planning and organizing, for OpenSQLCamp in Boston in October first, and then for conferences in 2011. I will also be moving apartments, which is a big task. And I will be focusing on some personal goals, such as spending more time with my husband and becoming more active.
I am excited about having a month off, even though I have a lot to work on in that month.
At Kscope this year, I attended a half day in-depth session entitled Data Warehousing Performance Best Practices, given by Maria Colgan of Oracle. My impression, which was confirmed by folks in the Oracle world, is that she knows her way around the Oracle optimizer.
See part 1 for the introduction and talking about power and hardware. This part will go over the 2nd “P”, partitioning. Learning about Oracle’s partitioning has gotten me more interested in how MySQL’s partitioning works, and I do hope that MySQL partitioning will develop to the level that Oracle partitioning does, because Oracle’s partitioning looks very nice (then again, that’s why it costs so much I guess).
Partition – Larger tables or fact tables can benefit from partitioning because it makes data load easier and can increase join performance and use data elimination. Parallel execution can be done with partitioning due to partition pruning. The degree of parallelism should be a power of 2, because of hash-based algorithm in hash partitioning. To translate this to the MySQL world, if you are using LINEAR HASH partitioning, then you should use a degree of parallelism that is a power of 2 (I checked, and indeed. Otherwise, use a degree of parallelism that makes sense given the number of partitions you have.
One important note that during Pythian’s testing of MySQL partitioning, we found that all partitions were locked when an INSERT occurs, for the duration of the INSERT. Bulk-loading with MySQL partitioning is not as fast as it would be if MySQL allowed partition pruning for INSERTs.
So, what should be partitioned? For the first level of partitioning, the goal is to enable partitioning pruning and simplify data management. The most typical partitioning is range or interval partitioning on a date column. Interval partitioning is you say what the partition is (date, month) and partition is automatically created. MySQL does not have interval partitioning, and I have seen typical first-level partitioning be range or list based on a date or timestamp column. Note that if you use a timestamp field, the partitioning expression is optimized if you use TO_DAYS(timestamp_field)
or YEAR(timestamp_field)
. In my experience, using anything else (such as DATE(timestamp_field)
) actually makes partitioning slower than not using partitioning at all. Note that this is based on tests I did a few months ago, and your mileage may vary.
So — how do you decide partitioning strategy? Ask yourself:
- What range of data do the queries touch – a quarter, a year?
- What is the data loading frequency?
- Is an incremental load required?
- How much data is involved, a day, a week, a month?
The answers to the above questions will tell you about how big your interval needs to be. The best scenario is that all answers are the same, “we load every day, and people query by day.” If the answers are different weight access a higher priority than loading, because most people care more about query performance than performance of ETL.
This is true even if your intervals have different sizes — ie sales per day are much bigger in Dec but that’s OK. However, Maria recommends that the subpartition be as evenly divided as possible.
Easier to look at more partitions than to look at a partition that’s too big. But you don’t want too many partitions, max Oracle allows partitions is 1 million partitions, prior to 11g it was 64,000. “Stick closer to 64,000 than 1 million”. MySQL’s limitation is 1024 per table.
For the second level of partitioning, also called subpartitioning, the goal is to allow for multi-level pruning and improve join performance. In Oracle, the most typical subpartition is hash or list – in MySQL, you can only subpartition by hash or key.
How do you decide subpartitioning strategy?
- Select the dimension queried most frequently on the fact table OR
- Pick the common join column
For example, if you want to look at sales per day, per store, you would choose “per day” as the partition and “per store” as the subpartition.
If you do not have a good partition on logical elements (like grouping), then you can subpartition using hash partitioning on common joins — perhaps surrogate keys, or using join key of the largest table involved in the join.
For example, if the sales table is partitioned and another big table is product, you can hash subpartition product_id.
Because there’s overhead in partitions (loading metadata, reading metadata), make sure size of partitions and subpartitions is >20 Mb. So better to have a 30 Mb subpartition than a 15 Mb subpartition. [I have no idea if this is true in MySQL or not — I think the general concept is true, because there is some overhead, but I have no idea about the 20 Mb figure and why that’s true for Oracle, nor do I know what is true in MySQL.]
One easy calculation is double the # of CPUs, round up to nearest power of 2. If you’re executing in parallel, Oracle will use 2x CPUs. (all this advice, by the way, follows 80/20 rule, this is probably good for about 80% of the environments out there). Of course, MySQL does not do parallel execution very well, so this probably does not apply.
Oracle knows it can get partition elimination while it does a join.
If 2 tables have the same degree of parallelism (same # of buckets) and are partitioned in the same way on the join column (say, customer_id in a subpartition of sales and a partition of customer), Oracle will match the partitions when joining:
sales table joined with customer table can change into 4 small joins:
sales sub part 1 joins with customer part 1
sales sub part 2 joins with customer part 2
sales sub part 3 joins with customer part 3
sales sub part 4 joins with customer part 4
And with parallelism, the total time is now reduced to the time it takes to do one of those smaller joins.
This is also why you want to have a power of 2 for buckets – because cores/processors come in powers of 2. Partition-wise joins like this can also be done with range or list, assuming both tables in the join have the same buckets.
I have no idea if MySQL partitioning works this way, but it’s certainly a functionality that makes sense to me.
At Kscope this year, I attended a half day in-depth session entitled Data Warehousing Performance Best Practices, given by Maria Colgan of Oracle. My impression, which was confirmed by folks in the Oracle world, is that she knows her way around the Oracle optimizer.
See part 1 for the introduction and talking about power and hardware. This part will go over the 2nd “P”, partitioning. Learning about Oracle’s partitioning has gotten me more interested in how MySQL’s partitioning works, and I do hope that MySQL partitioning will develop to the level that Oracle partitioning does, because Oracle’s partitioning looks very nice (then again, that’s why it costs so much I guess).
Partition – Larger tables or fact tables can benefit from partitioning because it makes data load easier and can increase join performance and use data elimination. Parallel execution can be done with partitioning due to partition pruning. The degree of parallelism should be a power of 2, because of hash-based algorithm in hash partitioning. To translate this to the MySQL world, if you are using LINEAR HASH partitioning, then you should use a degree of parallelism that is a power of 2 (I checked, and indeed. Otherwise, use a degree of parallelism that makes sense given the number of partitions you have.
One important note that during Pythian’s testing of MySQL partitioning, we found that all partitions were locked when an INSERT occurs, for the duration of the INSERT. Bulk-loading with MySQL partitioning is not as fast as it would be if MySQL allowed partition pruning for INSERTs.
So, what should be partitioned? For the first level of partitioning, the goal is to enable partitioning pruning and simplify data management. The most typical partitioning is range or interval partitioning on a date column. Interval partitioning is you say what the partition is (date, month) and partition is automatically created. MySQL does not have interval partitioning, and I have seen typical first-level partitioning be range or list based on a date or timestamp column. Note that if you use a timestamp field, the partitioning expression is optimized if you use TO_DAYS(timestamp_field)
or YEAR(timestamp_field)
. In my experience, using anything else (such as DATE(timestamp_field)
) actually makes partitioning slower than not using partitioning at all. Note that this is based on tests I did a few months ago, and your mileage may vary.
So — how do you decide partitioning strategy? Ask yourself:
- What range of data do the queries touch – a quarter, a year?
- What is the data loading frequency?
- Is an incremental load required?
- How much data is involved, a day, a week, a month?
The answers to the above questions will tell you about how big your interval needs to be. The best scenario is that all answers are the same, “we load every day, and people query by day.” If the answers are different weight access a higher priority than loading, because most people care more about query performance than performance of ETL.
This is true even if your intervals have different sizes — ie sales per day are much bigger in Dec but that’s OK. However, Maria recommends that the subpartition be as evenly divided as possible.
Easier to look at more partitions than to look at a partition that’s too big. But you don’t want too many partitions, max Oracle allows partitions is 1 million partitions, prior to 11g it was 64,000. “Stick closer to 64,000 than 1 million”. MySQL’s limitation is 1024 per table.
For the second level of partitioning, also called subpartitioning, the goal is to allow for multi-level pruning and improve join performance. In Oracle, the most typical subpartition is hash or list – in MySQL, you can only subpartition by hash or key.
How do you decide subpartitioning strategy?
- Select the dimension queried most frequently on the fact table OR
- Pick the common join column
For example, if you want to look at sales per day, per store, you would choose “per day” as the partition and “per store” as the subpartition.
If you do not have a good partition on logical elements (like grouping), then you can subpartition using hash partitioning on common joins — perhaps surrogate keys, or using join key of the largest table involved in the join.
For example, if the sales table is partitioned and another big table is product, you can hash subpartition product_id.
Because there’s overhead in partitions (loading metadata, reading metadata), make sure size of partitions and subpartitions is >20 Mb. So better to have a 30 Mb subpartition than a 15 Mb subpartition. [I have no idea if this is true in MySQL or not — I think the general concept is true, because there is some overhead, but I have no idea about the 20 Mb figure and why that’s true for Oracle, nor do I know what is true in MySQL.]
One easy calculation is double the # of CPUs, round up to nearest power of 2. If you’re executing in parallel, Oracle will use 2x CPUs. (all this advice, by the way, follows 80/20 rule, this is probably good for about 80% of the environments out there). Of course, MySQL does not do parallel execution very well, so this probably does not apply.
Oracle knows it can get partition elimination while it does a join.
If 2 tables have the same degree of parallelism (same # of buckets) and are partitioned in the same way on the join column (say, customer_id in a subpartition of sales and a partition of customer), Oracle will match the partitions when joining:
sales table joined with customer table can change into 4 small joins:
sales sub part 1 joins with customer part 1
sales sub part 2 joins with customer part 2
sales sub part 3 joins with customer part 3
sales sub part 4 joins with customer part 4
And with parallelism, the total time is now reduced to the time it takes to do one of those smaller joins.
This is also why you want to have a power of 2 for buckets – because cores/processors come in powers of 2. Partition-wise joins like this can also be done with range or list, assuming both tables in the join have the same buckets.
I have no idea if MySQL partitioning works this way, but it’s certainly a functionality that makes sense to me.