Dinner tonight at OSCON

So, dinner tonight will start with appetizers at the Cantina restaurant at the DoubleTree Hotel. MySQL was nice enough to have an entire reception just for me…..just kidding! However, that’s where I’ll be — since I told folks to watch this space, I wanted to make sure eveyrone knew where I’d be.

Also, I’d like to point out that there’s an obvious error in an algorithm on planetmysql.com, because it lists me as the #1 top poster.

So, dinner tonight will start with appetizers at the Cantina restaurant at the DoubleTree Hotel. MySQL was nice enough to have an entire reception just for me…..just kidding! However, that’s where I’ll be — since I told folks to watch this space, I wanted to make sure eveyrone knew where I’d be.

Also, I’d like to point out that there’s an obvious error in an algorithm on planetmysql.com, because it lists me as the #1 top poster.

PlanetMySQL dinner at OSCON, Tuesday July 25th

I will be free for dinner on Tuesday evening, July 25th, and I’d love it if you’d join me! Watch this space for details, although the update will be Tuesday, likely late afternoon.

(and of course let everyone know I’ll be doing my “So you’ve inherited a MySQL instance on unix” workshop — how a beginner can find out info/security checklist for everyone — Wednesday at 4:30 pm.)

I will be free for dinner on Tuesday evening, July 25th, and I’d love it if you’d join me! Watch this space for details, although the update will be Tuesday, likely late afternoon.

(and of course let everyone know I’ll be doing my “So you’ve inherited a MySQL instance on unix” workshop — how a beginner can find out info/security checklist for everyone — Wednesday at 4:30 pm.)

Partial Outer Join

I was wracking my brain for a bit, and could not figure out how to do a partial outer join where I was limiting rows from one table. So, I went to post to the MySQL user list — in writing my post, I found the answer. This often happens. It’s not that I don’t have questions to ask the MySQL user list (or forums), it’s just that most of the time, while I’m writing up the question, I find my answer.

First, I simplified my question. Many people do this, but they leave out important information. The trick is to come up with a simplification that you test to make sure it breaks in the same way the more complex, “real” query breaks.

The problem is basically this:
I want an outer join of a static table with a data table, such that I get a report of how many rows in the data table match the static table — including “0” if there are none, but for a subset of the data in the data table. Consider a billing report of how many invoices were generated per date, for a specified time period.

This should be easy, right? It was not very intuitive for me.

In this system, I’m using MySQL 4.1.19-standard-log so stored procedures aren’t an option. Also, I’d like to avoid a loop — the standard way of doing this (the “developer” way) is to get all the values in the static table, do a count of rows in the dynamic table for each row in the static table, and then display. However, I wanted to do it in very few queries, and so that the number of queries did not depend on the static data. I call this doing it the “DBA way”.

First, the setup:

use test;

CREATE TABLE `dynamic` (
`id` int(10) unsigned NOT NULL auto_increment,
`joinme` char(1) default NULL,
PRIMARY KEY (`id`));

CREATE TABLE `static`
( `joinme` char(1) default NULL,
`name` char(5) default NULL);

insert into static (joinme) VALUES('a'),('b'),('c'),('d'),
('e'),('f'),('g'),('h'),('i'),
('j'),('k'),('l'),('m'),('n'),
('o'),('p'),('q'),('r'),('s'),
('t'),('u'),('v'),('w'),('x'),
('y'),('z');

update static set name=repeat(joinme,5);

insert into dynamic (joinme) select joinme from static order by rand() limit 10;
insert into dynamic (joinme) select joinme from static order by rand() limit 10;
insert into dynamic (joinme) select joinme from static order by rand() limit 10;
insert into dynamic (joinme) select joinme from static order by rand() limit 10;
insert into dynamic (joinme) select joinme from static order by rand() limit 10;

First, a test to make sure everything’s OK:
# The following will give you the random distribution you have just made:
select count(id), joinme
from dynamic
group by joinme order by joinme;

# the result is < = 26 rows, depending on the random distribution.

# the following will show the distribution of all 26 names (aaaaa through zzzzz) represented in the dynamic table (well, all 26 joinme values, really, but that's irrelevant to the point)
SELECT s.name,COUNT(d.joinme)
FROM static s LEFT JOIN dynamic d USING (joinme)
GROUP BY name;

# and indeed, it gets the expected 26 rows, with 0 filled in for the count when there's no match in the data table.

So far, so good. Remember, I don’t want to join all of the entries in the dynamic table….because it could be millions of entries, and maybe I want to limit it to id’s of a certain number. So what if I want the distribution of names given the first 15 id’s?

Using:
SELECT s.name,COUNT(d.joinme)
FROM static s LEFT JOIN dynamic d USING (joinme)
WHERE d.id<15 group by name;

gives me the distribution, but only for the values that among those ids. So in one test, I get 13 rows, all with a count>0, instead of 26 rows, with some of them having a count of 0 and others having a count>0.

This breaks in the same way my more complex query.

How can I limit the values from one table and outer join them to another, retaining everything? This seems like a very simple outer join, but because I want to put limits, it gets tricky, and the intuitive stuff just does not work. I do not know if “partial outer join” is the right term, but it’s what I’ll use.

# UNION does not duplicate values, but only if all the values match:
SELECT name,'' FROM static union SELECT s.name,COUNT(d.joinme) FROM static s LEFT JOIN dynamic d USING (joinme) where d.id<15 group by name order by name;
gives:
+-------+---+
| name | |
+-------+---+
| aaaaa | |
| aaaaa | 1 |
| bbbbb | |
| ccccc | |
| ccccc | 1 |
....

# similarly, using 0....:
SELECT name,0 FROM static union SELECT s.name,COUNT(d.joinme) FROM static s LEFT JOIN dynamic d USING (joinme) where d.id<15 group by name order by name;
+-------+---+
| name | 0 |
+-------+---+
| aaaaa | 0 |
| aaaaa | 1 |
| bbbbb | 0 |
| ccccc | 1 |
| ccccc | 0 |
.....

# Maybe a HAVING clause?
SELECT s.name,COUNT(d.joinme),d.id FROM static s LEFT JOIN dynamic d USING (joinme) group by name having d.id<15;
# does not work (and gives me an extra (and nonsensical) field, because you can't put something in a HAVING clause unless it's selected).

#putting a JOIN in the FROM clause (not a good idea, but I'd do it if it worked):
SELECT s.name,COUNT(d.joinme) FROM static s LEFT JOIN (SELECT joinme,id from dynamic where id<15) d USING (joinme) where d.id<15 GROUP BY name;


So what’s the answer? Well, the answer actually lies in the JOIN, which is a bit tricky. Basically, I was thinking of ON and USING as “what you join the tables on.” While that’s true, particularly for USING, it’s a bit different for ON.

The manual page at:

http://dev.mysql.com/doc/refman/4.1/en/join.html

does not particularly make things clear.

However, the first comment makes it clear:
The join_condition is used for winnowing the rows of the “right half” that you want to use for the cross, whereas the WHERE clause is used for winnowing the composite.

In other words, putting a condition in the WHERE clause means it’s going to limit the results; if you put a condition in the join condition, it limits each part before the actual join. This was not intuitive to me at all. However, in playing around with different possibilities to send to the list a message of “I’m frustrated, here’s everything I’ve tried!”, I ended up finding the answer.

I hope this is helpful to others as well…..

I was wracking my brain for a bit, and could not figure out how to do a partial outer join where I was limiting rows from one table. So, I went to post to the MySQL user list — in writing my post, I found the answer. This often happens. It’s not that I don’t have questions to ask the MySQL user list (or forums), it’s just that most of the time, while I’m writing up the question, I find my answer.

First, I simplified my question. Many people do this, but they leave out important information. The trick is to come up with a simplification that you test to make sure it breaks in the same way the more complex, “real” query breaks.

The problem is basically this:
I want an outer join of a static table with a data table, such that I get a report of how many rows in the data table match the static table — including “0” if there are none, but for a subset of the data in the data table. Consider a billing report of how many invoices were generated per date, for a specified time period.

This should be easy, right? It was not very intuitive for me.

In this system, I’m using MySQL 4.1.19-standard-log so stored procedures aren’t an option. Also, I’d like to avoid a loop — the standard way of doing this (the “developer” way) is to get all the values in the static table, do a count of rows in the dynamic table for each row in the static table, and then display. However, I wanted to do it in very few queries, and so that the number of queries did not depend on the static data. I call this doing it the “DBA way”.

First, the setup:

use test;

CREATE TABLE `dynamic` (
`id` int(10) unsigned NOT NULL auto_increment,
`joinme` char(1) default NULL,
PRIMARY KEY (`id`));

CREATE TABLE `static`
( `joinme` char(1) default NULL,
`name` char(5) default NULL);

insert into static (joinme) VALUES('a'),('b'),('c'),('d'),
('e'),('f'),('g'),('h'),('i'),
('j'),('k'),('l'),('m'),('n'),
('o'),('p'),('q'),('r'),('s'),
('t'),('u'),('v'),('w'),('x'),
('y'),('z');

update static set name=repeat(joinme,5);

insert into dynamic (joinme) select joinme from static order by rand() limit 10;
insert into dynamic (joinme) select joinme from static order by rand() limit 10;
insert into dynamic (joinme) select joinme from static order by rand() limit 10;
insert into dynamic (joinme) select joinme from static order by rand() limit 10;
insert into dynamic (joinme) select joinme from static order by rand() limit 10;

First, a test to make sure everything’s OK:
# The following will give you the random distribution you have just made:
select count(id), joinme
from dynamic
group by joinme order by joinme;

# the result is < = 26 rows, depending on the random distribution.

# the following will show the distribution of all 26 names (aaaaa through zzzzz) represented in the dynamic table (well, all 26 joinme values, really, but that's irrelevant to the point)
SELECT s.name,COUNT(d.joinme)
FROM static s LEFT JOIN dynamic d USING (joinme)
GROUP BY name;

# and indeed, it gets the expected 26 rows, with 0 filled in for the count when there's no match in the data table.

So far, so good. Remember, I don’t want to join all of the entries in the dynamic table….because it could be millions of entries, and maybe I want to limit it to id’s of a certain number. So what if I want the distribution of names given the first 15 id’s?

Using:
SELECT s.name,COUNT(d.joinme)
FROM static s LEFT JOIN dynamic d USING (joinme)
WHERE d.id<15 group by name;

gives me the distribution, but only for the values that among those ids. So in one test, I get 13 rows, all with a count>0, instead of 26 rows, with some of them having a count of 0 and others having a count>0.

This breaks in the same way my more complex query.

How can I limit the values from one table and outer join them to another, retaining everything? This seems like a very simple outer join, but because I want to put limits, it gets tricky, and the intuitive stuff just does not work. I do not know if “partial outer join” is the right term, but it’s what I’ll use.

# UNION does not duplicate values, but only if all the values match:
SELECT name,'' FROM static union SELECT s.name,COUNT(d.joinme) FROM static s LEFT JOIN dynamic d USING (joinme) where d.id<15 group by name order by name;
gives:
+-------+---+
| name | |
+-------+---+
| aaaaa | |
| aaaaa | 1 |
| bbbbb | |
| ccccc | |
| ccccc | 1 |
....

# similarly, using 0....:
SELECT name,0 FROM static union SELECT s.name,COUNT(d.joinme) FROM static s LEFT JOIN dynamic d USING (joinme) where d.id<15 group by name order by name;
+-------+---+
| name | 0 |
+-------+---+
| aaaaa | 0 |
| aaaaa | 1 |
| bbbbb | 0 |
| ccccc | 1 |
| ccccc | 0 |
.....

# Maybe a HAVING clause?
SELECT s.name,COUNT(d.joinme),d.id FROM static s LEFT JOIN dynamic d USING (joinme) group by name having d.id<15;
# does not work (and gives me an extra (and nonsensical) field, because you can't put something in a HAVING clause unless it's selected).

#putting a JOIN in the FROM clause (not a good idea, but I'd do it if it worked):
SELECT s.name,COUNT(d.joinme) FROM static s LEFT JOIN (SELECT joinme,id from dynamic where id<15) d USING (joinme) where d.id<15 GROUP BY name;


So what’s the answer? Well, the answer actually lies in the JOIN, which is a bit tricky. Basically, I was thinking of ON and USING as “what you join the tables on.” While that’s true, particularly for USING, it’s a bit different for ON.

The manual page at:

http://dev.mysql.com/doc/refman/4.1/en/join.html

does not particularly make things clear.

However, the first comment makes it clear:
The join_condition is used for winnowing the rows of the “right half” that you want to use for the cross, whereas the WHERE clause is used for winnowing the composite.

In other words, putting a condition in the WHERE clause means it’s going to limit the results; if you put a condition in the join condition, it limits each part before the actual join. This was not intuitive to me at all. However, in playing around with different possibilities to send to the list a message of “I’m frustrated, here’s everything I’ve tried!”, I ended up finding the answer.

I hope this is helpful to others as well…..

Those Who Can, Do

I finally got around to reading The Cathedral and the Bazaar — I’d known about the collection of essays since it was first published in 1999, and kept meaning to read it. Heck, that link goes to the entirety of the essays, free, online. However, I never did read it, but I recently received a copy of the book, and have been reading it on my commute to work.

It’s very engaging, and still accurate, even now. Of course I think about how MySQL’s growth and business model as I’m reading it.

One of the tenets that keeps appearing is that “Brooks’ law that adding more people adds more complexity and actually slows a project down does not apply to open source.” He gives a few reasons, mostly communication, but there’s one crucial point he misses.

Brooks’ law does not apply to open source because there are little or no startup or coming to speed costs. The first contact with a developer is usually “Here’s the patch I’m submitting.” Now, perhaps that’s included in the management overhead that the essays refer to. However, if that’s the case it is completely unclear. I do not think this changes any parts of the arguments, nor any of the conclusions. But I do think it is important to note that not only are the candidates for modifying open source software filtered on self-selection, they are filtered on ability to actually do the work.

In a traditional job interview, there is very little “testing” done. Sure, there are often questions, some technical, others not; but interviewing is a completely different environment than actually doing the job. Communication may be poorer due to nerves or self-doubt in an interview, when it might not otherwise occur during the actual job. Very occasionally, the traditional closed source world will give a job candidate a task to do on their own, and the completed task and resultant communication are the merits on which hiring is based.

If closed source hiring worked the same way open source hiring (whether for a volunteer position or paid) works, perhaps Brooks’ law would also not apply. Why is this not done, then? Mostly it’s seen that a new employee or candidate would need their hands held, or that they cannot be allowed to see the code, because it’s closed source, and the (or a) revenue stream.

At any rate, those who work and succeed on open source software, whether they garner a paycheck or not, are by definition good at what they do, because they’ve already succeeded. In a traditional job, the costs of finding someone new and the liability of letting someone go because they “aren’t producing” are high. Therefore, it’s much easier to coast in these positions, and there are some folks who do. I have seen some rather impressive resumes for some rather unimpressive job candidates/co-workers.

I have seen some pretty unimpressive open source software (heck, I’ve written some myself); but for the most part, it all works as described. It may lack features, be very inefficient, or be limited on which systems it runs, but it almost always works without hacking. It’s difficult to say the same about closed source; one only needs to look at PeopleSoft or Remedy to see that. I’m sure there are plenty of examples; those are two examples I’m familiar with.

At any rate, the point is, those who can, do. That’s pretty much the motto of open source. I have some co-workers who rib me for donating to the EFF in order to get my name in the MySQL Source Code, because “it’s open source. You can get in there much more easily by submitting a patch.” This is true, in theory. In reality, I haven’t written any C code nor looked at code more complicated than a shell script in 5 years. One of the projects in the pipeline is to delve into the source code, but I’m not there yet.

So instead of submitting patches, I run the Boston MySQL User Group. I write articles in this blog. I help others optimize queries and MySQL configurations. We all do what we can, and that’s what makes open source great.

So the point of this? Do not be intimidated or discouraged because others are successful. Use them as a model, work hard, do what you can do, and you will be successful, too.

I finally got around to reading The Cathedral and the Bazaar — I’d known about the collection of essays since it was first published in 1999, and kept meaning to read it. Heck, that link goes to the entirety of the essays, free, online. However, I never did read it, but I recently received a copy of the book, and have been reading it on my commute to work.

It’s very engaging, and still accurate, even now. Of course I think about how MySQL’s growth and business model as I’m reading it.

One of the tenets that keeps appearing is that “Brooks’ law that adding more people adds more complexity and actually slows a project down does not apply to open source.” He gives a few reasons, mostly communication, but there’s one crucial point he misses.

Brooks’ law does not apply to open source because there are little or no startup or coming to speed costs. The first contact with a developer is usually “Here’s the patch I’m submitting.” Now, perhaps that’s included in the management overhead that the essays refer to. However, if that’s the case it is completely unclear. I do not think this changes any parts of the arguments, nor any of the conclusions. But I do think it is important to note that not only are the candidates for modifying open source software filtered on self-selection, they are filtered on ability to actually do the work.

In a traditional job interview, there is very little “testing” done. Sure, there are often questions, some technical, others not; but interviewing is a completely different environment than actually doing the job. Communication may be poorer due to nerves or self-doubt in an interview, when it might not otherwise occur during the actual job. Very occasionally, the traditional closed source world will give a job candidate a task to do on their own, and the completed task and resultant communication are the merits on which hiring is based.

If closed source hiring worked the same way open source hiring (whether for a volunteer position or paid) works, perhaps Brooks’ law would also not apply. Why is this not done, then? Mostly it’s seen that a new employee or candidate would need their hands held, or that they cannot be allowed to see the code, because it’s closed source, and the (or a) revenue stream.

At any rate, those who work and succeed on open source software, whether they garner a paycheck or not, are by definition good at what they do, because they’ve already succeeded. In a traditional job, the costs of finding someone new and the liability of letting someone go because they “aren’t producing” are high. Therefore, it’s much easier to coast in these positions, and there are some folks who do. I have seen some rather impressive resumes for some rather unimpressive job candidates/co-workers.

I have seen some pretty unimpressive open source software (heck, I’ve written some myself); but for the most part, it all works as described. It may lack features, be very inefficient, or be limited on which systems it runs, but it almost always works without hacking. It’s difficult to say the same about closed source; one only needs to look at PeopleSoft or Remedy to see that. I’m sure there are plenty of examples; those are two examples I’m familiar with.

At any rate, the point is, those who can, do. That’s pretty much the motto of open source. I have some co-workers who rib me for donating to the EFF in order to get my name in the MySQL Source Code, because “it’s open source. You can get in there much more easily by submitting a patch.” This is true, in theory. In reality, I haven’t written any C code nor looked at code more complicated than a shell script in 5 years. One of the projects in the pipeline is to delve into the source code, but I’m not there yet.

So instead of submitting patches, I run the Boston MySQL User Group. I write articles in this blog. I help others optimize queries and MySQL configurations. We all do what we can, and that’s what makes open source great.

So the point of this? Do not be intimidated or discouraged because others are successful. Use them as a model, work hard, do what you can do, and you will be successful, too.

Certification By the Numbers

So,

http://www.mysql.com/training/certification/candidates.php?exam=core

has 233 names for the core exam, which has been out for a few years now. I passed it in 2004, and sure enough, my name is one of the 75 in the US. Are there really so few people out there, or are they just not taking/passing the exam?

I did notice at the MySQL Users Conference that only 1/4 to 1/3 of the exams earned a passing grade, and given that you have to pass 2 exams to receive certification, the chances of passing are not in your favor. Granted, studying helps (as does experience), which is how I passed the Core exam in 2004 and then both 5.0 DBA exams in April (2006).

Like much of the rest of life, it’s not difficult, you just have to work at it a bit. Read the certification guide, and study, and use MySQL, and you’ll pass.

I’m quite amazed at the low amount of people who are actually certified. I guess it’s better for me; 1 of 75 in the country and 1 of 233 in the world that are Core certified. I can’t wait to see the numbers when the 5.0 certification results are up.

Perhaps that will be Roland’s first task…. 😉

So,

http://www.mysql.com/training/certification/candidates.php?exam=core

has 233 names for the core exam, which has been out for a few years now. I passed it in 2004, and sure enough, my name is one of the 75 in the US. Are there really so few people out there, or are they just not taking/passing the exam?

I did notice at the MySQL Users Conference that only 1/4 to 1/3 of the exams earned a passing grade, and given that you have to pass 2 exams to receive certification, the chances of passing are not in your favor. Granted, studying helps (as does experience), which is how I passed the Core exam in 2004 and then both 5.0 DBA exams in April (2006).

Like much of the rest of life, it’s not difficult, you just have to work at it a bit. Read the certification guide, and study, and use MySQL, and you’ll pass.

I’m quite amazed at the low amount of people who are actually certified. I guess it’s better for me; 1 of 75 in the country and 1 of 233 in the world that are Core certified. I can’t wait to see the numbers when the 5.0 certification results are up.

Perhaps that will be Roland’s first task…. 😉

MySQL Related Nagios Plugins

After seeing Frank’s script in “Finding out how far behind are slaves” I figured I would post some nagios plugins I wrote a while ago, that we use at our company. They’re already up at http://www.nagiosexchange.org/, the 3rd party repository for nagios plugins.

So I figured I’d point you to:
Replication Lag Time
and
InnoDB Free Space

I’ll note that backups from a slave server can cause replication lag time to fall behind, for 2 reasons — mysqldump falls under “heavy SELECTs” as Frank has mentioned, but also zipping up logs or general high load/high CPU usage on a server can also cause the database to fall behind, because less CPU time is given to MySQL.

After seeing Frank’s script in “Finding out how far behind are slaves” I figured I would post some nagios plugins I wrote a while ago, that we use at our company. They’re already up at http://www.nagiosexchange.org/, the 3rd party repository for nagios plugins.

So I figured I’d point you to:
Replication Lag Time
and
InnoDB Free Space

I’ll note that backups from a slave server can cause replication lag time to fall behind, for 2 reasons — mysqldump falls under “heavy SELECTs” as Frank has mentioned, but also zipping up logs or general high load/high CPU usage on a server can also cause the database to fall behind, because less CPU time is given to MySQL.

Jim Starkey Speaks, July Boston MySQL User Group Meeting

Please feel free to forward to interested parties.

Who: Jim Starkey at the Boston MySQL User Group
What: Falcon, the new MySQL storage engine
When:
Monday, July 10, 2006 at 7:00 PM
Where:
MIT Building E51, Room 372
Wadsworth and Amherst Streets
Cambridge, MA 02117
Steps from the Red line, plenty of free parking.

The July Boston MySQL User Group’s topic is Falcon, the new storage engine for MySQL. Creator Jim Starkey will speak. Jim Starkey has been writing database software for 20 years. He created BLOBs, multi-versioning concurrency for relational databases, cascading update triggers, event alerters, and more. Read more about him at http://tinyurl.com/lno4p and http://tinyurl.com/mym7d.

We will be meeting on MIT campus, close to the Kendall stop on the Red Line (subway). There is also plenty of free parking — you can park in ANY MIT lot after 3 pm, even if it says “parking by permit only”. We are in building E51, room 372.

Here is the URL for the MIT Map with the location of this building:
http://whereis.mit.edu/map-jpg?selection=E51&Buildings=go

This map shows the MBTA Kendall Stop:
http://whereis.mit.edu/map-jpg?selection=L5&Landmarks=go
(the stop is in red on that map, and you can see E51 in the bottom right)

Here are the URL’s for the parking lots:
http://whereis.mit.edu/map-jpg?selection=P4&Parking=go
http://whereis.mit.edu/map-jpg?selection=P5&Parking=go

Free pizza and soda will be served, so please RSVP accurately and arrive a few minutes early.

To RSVP anonymously, please login to the Meetup site (http://mysql.meetup.com/137) with the e-mail address “admin at sheeri dot com” and the password of “guest”.

Sponsors welcome, call Sheeri at 857-205-9786 for details.

Please feel free to forward to interested parties.

Who: Jim Starkey at the Boston MySQL User Group
What: Falcon, the new MySQL storage engine
When:
Monday, July 10, 2006 at 7:00 PM
Where:
MIT Building E51, Room 372
Wadsworth and Amherst Streets
Cambridge, MA 02117
Steps from the Red line, plenty of free parking.

The July Boston MySQL User Group’s topic is Falcon, the new storage engine for MySQL. Creator Jim Starkey will speak. Jim Starkey has been writing database software for 20 years. He created BLOBs, multi-versioning concurrency for relational databases, cascading update triggers, event alerters, and more. Read more about him at http://tinyurl.com/lno4p and http://tinyurl.com/mym7d.

We will be meeting on MIT campus, close to the Kendall stop on the Red Line (subway). There is also plenty of free parking — you can park in ANY MIT lot after 3 pm, even if it says “parking by permit only”. We are in building E51, room 372.

Here is the URL for the MIT Map with the location of this building:
http://whereis.mit.edu/map-jpg?selection=E51&Buildings=go

This map shows the MBTA Kendall Stop:
http://whereis.mit.edu/map-jpg?selection=L5&Landmarks=go
(the stop is in red on that map, and you can see E51 in the bottom right)

Here are the URL’s for the parking lots:
http://whereis.mit.edu/map-jpg?selection=P4&Parking=go
http://whereis.mit.edu/map-jpg?selection=P5&Parking=go

Free pizza and soda will be served, so please RSVP accurately and arrive a few minutes early.

To RSVP anonymously, please login to the Meetup site (http://mysql.meetup.com/137) with the e-mail address “admin at sheeri dot com” and the password of “guest”.

Sponsors welcome, call Sheeri at 857-205-9786 for details.

Populating Images Into a Test Database

Guiseppe Maxia recently posted on Filling Test Tables Quickly. However, he only gave examples for numbers, words and dates. This article will deal with randomly inserting images into a database — particularly useful for testing images stored in the database versus the filesystem. This article was inspired by Guiseppe’s article.

[note: there will be a post soon with my presentation on storing images in the MySQL filesystem, and there will be ample opportunity for discussion then, especially since everyone making a claim will have tested their system, given this code.]

Getting images into the database is not easy. All the tutorials I have read have used a programming language to get images into a database. Using cat and similar unix tools, I have attempted to be able to insert an image into a database using only the mysql client commandline tool. That failed — I would love to know if anyone has any tips on inserting an image into a database without using a programming language.

So, I compromised. I put 20 images into a table as a baseline, and then wrote a stored procedure to suit my needs. Basically, users can have up to 6 images, so I wanted to replicate that possibility in the test db.

This stored procedure takes in 3 integers: maximum # of images per uid, range of random images from the “originals” table, and # of uids to use when populating the table. You will also need a table called “originals” containing the original images. A sample table, with 20 images, can be downloaded from:
http://www.sheeri.net/public/originalimages.sql.gz.

These images were taken from:
http://www.takigen.co.jp/japanese/images/number without explicit permission. I have no idea what Takigen is as I do not read Japanese. I ran a Google Image Search for tiny pictures involving numbers, and found that website that way.

To import it, run at the shell prompt:

> gunzip originalimages.sql.gz
> mysql dbname < originalimages.sql

(explanation below the code)


delimiter $$

drop procedure if exists populate_images $$
create procedure populate_images(max_images INT,num_originals INT,finalcount INT)
not deterministic
modifies sql data
begin
declare uid int default 0;
declare num_images int default 0;
declare image_num int default 0;
declare image_counter int default 0;
set max_images:=max_images+1;
set num_originals:=num_originals+1;
while uid < = finalcount do set image_counter:=0; set num_images:=FLOOR(RAND() * max_images); while image_counter < num_images do set image_num:=FLOOR(RAND() * num_originals); INSERT INTO images (uid,image,uploaded) SELECT uid,image,NOW() from originals where imageNum=image_num; set image_counter:=image_counter+1; end while; set uid:=uid+1; end while; end $$

delimiter ;


Explanation:
declare uid int default 0;
Initialize the counter for the current uid.

declare num_images int default 0;
Initialize the variable for the number of images per user; selected by random with a maximum as given in the 1st argument to the procedure.

declare image_num int default 0;
Initialze the variable that will hold which image to retrieve from the originals table. Selected by random for each image.

declare image_counter int default 0;
Initialize the counter for the image number; reset to 0 for each uid.

set max_images:=max_images+1;
set num_originals:=num_originals+1;

Variables are increased by 1 because I used < in the code instead of <= . I guess one could change the code to reflect this. while uid < = finalcount do
loop through the appropriate # of uids…

set image_counter:=0;
set num_images:=FLOOR(RAND() * max_images);

reset the image counter and get a new # of images for this particular uid, between 0 and max_images. If you want to specify between 1 and max_images, change to 1+FLOOR(RAND() * max_images)

while image_counter < num_images do set image_num:=FLOOR(RAND() * num_originals);
Get a random number between 1 and num_originals (ie, how many images there are to choose from).

INSERT INTO images (uid,image,uploaded) SELECT uid,image,NOW() from originals where imageNum=image_num;
Insert the random image and the uid and time uploaded into the images table.

set image_counter:=image_counter+1;
end while;

increment the image counter — you are finished with this image for this uid

set uid:=uid+1;
end while;
Increment the uid counter -- you are finished with this uid

end $$
Procedure end


Giuseppe’s article talked about randomly filling things quickly. This isn’t as lightning fast as other methods. To wit:

5 uids produced 20 image entries and took 1.14 seconds. (using call populate_images(6,20,5);)
50 uids produced 126 image entries and took 7.37 seconds.
500 uids produced 1416 image entries and took 1 min 26.48 seconds.
5,000 uids produced 14049 image entries and took 14 min 40.68 seconds.
(it scales linearly per image entry, not suprisingly).
So 50,000 uids would take about 2.5 hours, and 500,000 entries would take about a day to randomly fill.

Alternatively, using this procedure to randomly fill 5 uids and then repeating
insert into images (image,uid,uploaded) select image,uid,NOW() from images;

Takes:
5 uids, 1.33 seconds
50 uids, 1.60 seconds
500 uids, 2.03 seconds
5000 uids, 3.08 seconds
50000 uids, 30.45 seconds
500000 uids, 8 min, 47.10 seconds

But that’s only got 5 really random ones, copied a bunch of times. It may not be random enough. How about 1,000 random ones, and then copy them (I didn’t change uids, although I should have, I just didn’t feel like coding it):

call populate_images(6,20,1000);
insert into images (image,uid,uploaded) select image,uid,NOW() from images;

(repeating that last line)

Takes:
1000 uids, 2 min 53.66 seconds
5000 uids, 2 min 54.24 seconds
50000 uids, 3 min 31.91 seconds
500000 uids, 20 min 45.08 seconds

Comments and suggestions welcome.

Guiseppe Maxia recently posted on Filling Test Tables Quickly. However, he only gave examples for numbers, words and dates. This article will deal with randomly inserting images into a database — particularly useful for testing images stored in the database versus the filesystem. This article was inspired by Guiseppe’s article.

[note: there will be a post soon with my presentation on storing images in the MySQL filesystem, and there will be ample opportunity for discussion then, especially since everyone making a claim will have tested their system, given this code.]

Getting images into the database is not easy. All the tutorials I have read have used a programming language to get images into a database. Using cat and similar unix tools, I have attempted to be able to insert an image into a database using only the mysql client commandline tool. That failed — I would love to know if anyone has any tips on inserting an image into a database without using a programming language.

So, I compromised. I put 20 images into a table as a baseline, and then wrote a stored procedure to suit my needs. Basically, users can have up to 6 images, so I wanted to replicate that possibility in the test db.

This stored procedure takes in 3 integers: maximum # of images per uid, range of random images from the “originals” table, and # of uids to use when populating the table. You will also need a table called “originals” containing the original images. A sample table, with 20 images, can be downloaded from:
http://www.sheeri.net/public/originalimages.sql.gz.

These images were taken from:
http://www.takigen.co.jp/japanese/images/number without explicit permission. I have no idea what Takigen is as I do not read Japanese. I ran a Google Image Search for tiny pictures involving numbers, and found that website that way.

To import it, run at the shell prompt:

> gunzip originalimages.sql.gz
> mysql dbname < originalimages.sql

(explanation below the code)


delimiter $$

drop procedure if exists populate_images $$
create procedure populate_images(max_images INT,num_originals INT,finalcount INT)
not deterministic
modifies sql data
begin
declare uid int default 0;
declare num_images int default 0;
declare image_num int default 0;
declare image_counter int default 0;
set max_images:=max_images+1;
set num_originals:=num_originals+1;
while uid < = finalcount do set image_counter:=0; set num_images:=FLOOR(RAND() * max_images); while image_counter < num_images do set image_num:=FLOOR(RAND() * num_originals); INSERT INTO images (uid,image,uploaded) SELECT uid,image,NOW() from originals where imageNum=image_num; set image_counter:=image_counter+1; end while; set uid:=uid+1; end while; end $$

delimiter ;


Explanation:
declare uid int default 0;
Initialize the counter for the current uid.

declare num_images int default 0;
Initialize the variable for the number of images per user; selected by random with a maximum as given in the 1st argument to the procedure.

declare image_num int default 0;
Initialze the variable that will hold which image to retrieve from the originals table. Selected by random for each image.

declare image_counter int default 0;
Initialize the counter for the image number; reset to 0 for each uid.

set max_images:=max_images+1;
set num_originals:=num_originals+1;

Variables are increased by 1 because I used < in the code instead of <= . I guess one could change the code to reflect this. while uid < = finalcount do
loop through the appropriate # of uids…

set image_counter:=0;
set num_images:=FLOOR(RAND() * max_images);

reset the image counter and get a new # of images for this particular uid, between 0 and max_images. If you want to specify between 1 and max_images, change to 1+FLOOR(RAND() * max_images)

while image_counter < num_images do set image_num:=FLOOR(RAND() * num_originals);
Get a random number between 1 and num_originals (ie, how many images there are to choose from).

INSERT INTO images (uid,image,uploaded) SELECT uid,image,NOW() from originals where imageNum=image_num;
Insert the random image and the uid and time uploaded into the images table.

set image_counter:=image_counter+1;
end while;

increment the image counter — you are finished with this image for this uid

set uid:=uid+1;
end while;
Increment the uid counter -- you are finished with this uid

end $$
Procedure end


Giuseppe’s article talked about randomly filling things quickly. This isn’t as lightning fast as other methods. To wit:

5 uids produced 20 image entries and took 1.14 seconds. (using call populate_images(6,20,5);)
50 uids produced 126 image entries and took 7.37 seconds.
500 uids produced 1416 image entries and took 1 min 26.48 seconds.
5,000 uids produced 14049 image entries and took 14 min 40.68 seconds.
(it scales linearly per image entry, not suprisingly).
So 50,000 uids would take about 2.5 hours, and 500,000 entries would take about a day to randomly fill.

Alternatively, using this procedure to randomly fill 5 uids and then repeating
insert into images (image,uid,uploaded) select image,uid,NOW() from images;

Takes:
5 uids, 1.33 seconds
50 uids, 1.60 seconds
500 uids, 2.03 seconds
5000 uids, 3.08 seconds
50000 uids, 30.45 seconds
500000 uids, 8 min, 47.10 seconds

But that’s only got 5 really random ones, copied a bunch of times. It may not be random enough. How about 1,000 random ones, and then copy them (I didn’t change uids, although I should have, I just didn’t feel like coding it):

call populate_images(6,20,1000);
insert into images (image,uid,uploaded) select image,uid,NOW() from images;

(repeating that last line)

Takes:
1000 uids, 2 min 53.66 seconds
5000 uids, 2 min 54.24 seconds
50000 uids, 3 min 31.91 seconds
500000 uids, 20 min 45.08 seconds

Comments and suggestions welcome.

Good Ideas Implemented Poorly

Please share good ideas implemented poorly you have seen.

For example, in the past 24 hours I have come across 2 such beasts:

1) A company doing vertical scaling based on username. Unsurprisingly, they found that 70% of their users had usernames beginning with 1 of 2 letters. A better idea? Do it by last 2 digits of a user id. That way, even though users are created and perhaps deleted all the time, the distribution is effectively random. Language is not.

With 2 significant figures, you can divide vertically many different ways — 1 large table of all the uids, 2 tables of uids ending in 00-49 and 50-99, 4 tables, 10 tables, 25 tables, 50 tables, 100 tables. This eliminates the need for a separate table for each user.

For example, the online dating site I work for uses this method for messaging. So a user’s inbox is just one part of a table. If the user has uid 12345, their “Inbox” is the result of SELECT * FROM Inbox_45 WHERE toUid=12345. Their “Sent” box is the result of SELECT * FROM Inbox_45 WHERE fromUid=12345. We have over 650,000 active users; I can only imagine what horrors we would have if we had 1 table for each person’s mailbox (much less 1 each for Trash, Sent, Saved and Inbox).

This also helps when using InnoDB tables, as we are — the count(*) query to find the number of messages in your boxes uses the indexes on toUid and fromUid on the tables. 1 table per uid would not take advantage of this index, although it would probably be a bit faster in terms of retrieval. We kept this idea in mind — if we had one table per uid, we probably could use MyISAM tables, and counts would be faster. But again, horrors with 650,000 tables!

Also, you want to keep your code as flexible as possible. Do not hard code which servers have which tables — use SHOW TABLES in scripts and such. We truncate trashes nightly, and simply have a loop that says:


(pseudocode)
for i in "show tables"
do
truncate Trash_$i
done

2) The lead developer on a new project had the rule of “each table has a class”, which led another developer to come to me asking to resolve a problem. You guessed it — tables in different databases that have the same name. We have a static database, with a table called “Preferences”. This is a joining table, that says “preference #1 is always blah”. And in our profiles database for a site, we also have a “Preferences” table — this is where we associate users with their preferences.

The goal is world domination to be able to have multiple sites use similar codebases and data schema. Therefore, it is perfectly reasonable to have databases with identical schemas and different names. Site1 can have an identical schema and codebase to Site2, with storing the data in the “Site1” and “Site2” databases. The reason to have 2 separate databases is that we want to be avoid having to put which site the data refers to as a part of each row.

Any other examples of good ideas implemented poorly? Feel free to make an entry, but if it’s not on PlanetMySQL.org please put a link in the comments so I’m sure to actually read it.

Followers may note I fell way short of my goal of 1 post per day in May; in fact, I had 15 posts in May, not the 31 I had hoped. I’m OK with that, and I apologize for the long radio silence. In the meantime, I had a wonderful few weeks, including biking on the island of Martha’s Vineyard, and a wonderful phone conversation with Frank Mash last night.

If anyone cares to know, my Skype name is awfief. I’m happy to talk to folks; I’m GMT -5 (Eastern Standard Time, Daylight Savings when appropriate). I also have a tendency to leave the program on while I’m in the room, so feel free to ring (instead of Skype Chat) to see if I’m around.

Please share good ideas implemented poorly you have seen.

For example, in the past 24 hours I have come across 2 such beasts:

1) A company doing vertical scaling based on username. Unsurprisingly, they found that 70% of their users had usernames beginning with 1 of 2 letters. A better idea? Do it by last 2 digits of a user id. That way, even though users are created and perhaps deleted all the time, the distribution is effectively random. Language is not.

With 2 significant figures, you can divide vertically many different ways — 1 large table of all the uids, 2 tables of uids ending in 00-49 and 50-99, 4 tables, 10 tables, 25 tables, 50 tables, 100 tables. This eliminates the need for a separate table for each user.

For example, the online dating site I work for uses this method for messaging. So a user’s inbox is just one part of a table. If the user has uid 12345, their “Inbox” is the result of SELECT * FROM Inbox_45 WHERE toUid=12345. Their “Sent” box is the result of SELECT * FROM Inbox_45 WHERE fromUid=12345. We have over 650,000 active users; I can only imagine what horrors we would have if we had 1 table for each person’s mailbox (much less 1 each for Trash, Sent, Saved and Inbox).

This also helps when using InnoDB tables, as we are — the count(*) query to find the number of messages in your boxes uses the indexes on toUid and fromUid on the tables. 1 table per uid would not take advantage of this index, although it would probably be a bit faster in terms of retrieval. We kept this idea in mind — if we had one table per uid, we probably could use MyISAM tables, and counts would be faster. But again, horrors with 650,000 tables!

Also, you want to keep your code as flexible as possible. Do not hard code which servers have which tables — use SHOW TABLES in scripts and such. We truncate trashes nightly, and simply have a loop that says:


(pseudocode)
for i in "show tables"
do
truncate Trash_$i
done

2) The lead developer on a new project had the rule of “each table has a class”, which led another developer to come to me asking to resolve a problem. You guessed it — tables in different databases that have the same name. We have a static database, with a table called “Preferences”. This is a joining table, that says “preference #1 is always blah”. And in our profiles database for a site, we also have a “Preferences” table — this is where we associate users with their preferences.

The goal is world domination to be able to have multiple sites use similar codebases and data schema. Therefore, it is perfectly reasonable to have databases with identical schemas and different names. Site1 can have an identical schema and codebase to Site2, with storing the data in the “Site1” and “Site2” databases. The reason to have 2 separate databases is that we want to be avoid having to put which site the data refers to as a part of each row.

Any other examples of good ideas implemented poorly? Feel free to make an entry, but if it’s not on PlanetMySQL.org please put a link in the comments so I’m sure to actually read it.

Followers may note I fell way short of my goal of 1 post per day in May; in fact, I had 15 posts in May, not the 31 I had hoped. I’m OK with that, and I apologize for the long radio silence. In the meantime, I had a wonderful few weeks, including biking on the island of Martha’s Vineyard, and a wonderful phone conversation with Frank Mash last night.

If anyone cares to know, my Skype name is awfief. I’m happy to talk to folks; I’m GMT -5 (Eastern Standard Time, Daylight Savings when appropriate). I also have a tendency to leave the program on while I’m in the room, so feel free to ring (instead of Skype Chat) to see if I’m around.

Have you tried it?

So, I try to answer questions on the mysql users list. It usually frustrates me, but I also want to help. And often, I learn things.

My 2nd biggest pet peeve is that people don’t think to Google their findings. Or even “MySQL” their findings. If you have a short phrase or word, type the following into your address bar and MySQL does the right thing:

http://www.mysql.com/short word or phrase

try it:

http://www.mysql.com/replication
http://www.mysql.com/insert syntax
http://www.mysql.com/can’t%20connect

The bigger pet peeve I have is that people are AFRAID. They shake in their boots if there’s something they don’t understand. This is why there are such things as test servers. Most of the time you can run a query on a test server and see if it does the right thing.

We’re not creating bombs here. Sure, if you mess up a delete query you might mess up your system. But a SELECT query can always be “kill” ed. And a test server does not care if you delete data.

There’s no excuse for not having a test server. MySQL is available for many platforms. It’s plenty easy to mysqldump your schema, and your data too. Have lots of data? Use SELECT and LIMIT.

It’s quite frustrating as someone who helps people to get the attitude of “I don’t know what to do?!!?” So yes, when someone says “Would this work?” it makes me want to scream. Wouldn’t it be easier for you to try it, than to have me set up a test scenario for you?

Now of course, this is different from the “This should work and does not, why?” and that’s OK.

So, I try to answer questions on the mysql users list. It usually frustrates me, but I also want to help. And often, I learn things.

My 2nd biggest pet peeve is that people don’t think to Google their findings. Or even “MySQL” their findings. If you have a short phrase or word, type the following into your address bar and MySQL does the right thing:

http://www.mysql.com/short word or phrase

try it:

http://www.mysql.com/replication
http://www.mysql.com/insert syntax
http://www.mysql.com/can’t%20connect

The bigger pet peeve I have is that people are AFRAID. They shake in their boots if there’s something they don’t understand. This is why there are such things as test servers. Most of the time you can run a query on a test server and see if it does the right thing.

We’re not creating bombs here. Sure, if you mess up a delete query you might mess up your system. But a SELECT query can always be “kill” ed. And a test server does not care if you delete data.

There’s no excuse for not having a test server. MySQL is available for many platforms. It’s plenty easy to mysqldump your schema, and your data too. Have lots of data? Use SELECT and LIMIT.

It’s quite frustrating as someone who helps people to get the attitude of “I don’t know what to do?!!?” So yes, when someone says “Would this work?” it makes me want to scream. Wouldn’t it be easier for you to try it, than to have me set up a test scenario for you?

Now of course, this is different from the “This should work and does not, why?” and that’s OK.