Sheeri Dot Org – Page 57 – Musings on topics including product management, tech, art.

Dinner tonight at OSCON

So, dinner tonight will start with appetizers at the Cantina restaurant at the DoubleTree Hotel. MySQL was nice enough to have an entire reception just for me…..just kidding! However, that’s where I’ll be — since I told folks to watch this space, I wanted to make sure eveyrone knew where I’d be.

Also, I’d like to point out that there’s an obvious error in an algorithm on planetmysql.com, because it lists me as the #1 top poster.

PlanetMySQL dinner at OSCON, Tuesday July 25th

I will be free for dinner on Tuesday evening, July 25th, and I’d love it if you’d join me! Watch this space for details, although the update will be Tuesday, likely late afternoon.

(and of course let everyone know I’ll be doing my “So you’ve inherited a MySQL instance on unix” workshop — how a beginner can find out info/security checklist for everyone — Wednesday at 4:30 pm.)

I will be free for dinner on Tuesday evening, July 25th, and I’d love it if you’d join me! Watch this space for details, although the update will be Tuesday, likely late afternoon.

Partial Outer Join

I was wracking my brain for a bit, and could not figure out how to do a partial outer join where I was limiting rows from one table. So, I went to post to the MySQL user list — in writing my post, I found the answer. This often happens. It’s not that I don’t have questions to ask the MySQL user list (or forums), it’s just that most of the time, while I’m writing up the question, I find my answer.

First, I simplified my question. Many people do this, but they leave out important information. The trick is to come up with a simplification that you test to make sure it breaks in the same way the more complex, “real” query breaks.

The problem is basically this:
I want an outer join of a static table with a data table, such that I get a report of how many rows in the data table match the static table — including “0” if there are none, but for a subset of the data in the data table. Consider a billing report of how many invoices were generated per date, for a specified time period.

This should be easy, right? It was not very intuitive for me.

In this system, I’m using MySQL 4.1.19-standard-log so stored procedures aren’t an option. Also, I’d like to avoid a loop — the standard way of doing this (the “developer” way) is to get all the values in the static table, do a count of rows in the dynamic table for each row in the static table, and then display. However, I wanted to do it in very few queries, and so that the number of queries did not depend on the static data. I call this doing it the “DBA way”.

First, the setup:

use test;

CREATE TABLE `dynamic` ( `id` int(10) unsigned NOT NULL auto_increment, `joinme` char(1) default NULL, PRIMARY KEY (`id`));

CREATE TABLE `static` ( `joinme` char(1) default NULL, `name` char(5) default NULL);

insert into static (joinme) VALUES('a'),('b'),('c'),('d'), ('e'),('f'),('g'),('h'),('i'), ('j'),('k'),('l'),('m'),('n'), ('o'),('p'),('q'),('r'),('s'), ('t'),('u'),('v'),('w'),('x'), ('y'),('z');

update static set name=repeat(joinme,5);

insert into dynamic (joinme) select joinme from static order by rand() limit 10; insert into dynamic (joinme) select joinme from static order by rand() limit 10; insert into dynamic (joinme) select joinme from static order by rand() limit 10; insert into dynamic (joinme) select joinme from static order by rand() limit 10; insert into dynamic (joinme) select joinme from static order by rand() limit 10;

First, a test to make sure everything’s OK:
# The following will give you the random distribution you have just made: select count(id), joinme from dynamic group by joinme order by joinme;

# the result is < = 26 rows, depending on the random distribution.

# the following will show the distribution of all 26 names (aaaaa through zzzzz) represented in the dynamic table (well, all 26 joinme values, really, but that's irrelevant to the point) SELECT s.name,COUNT(d.joinme) FROM static s LEFT JOIN dynamic d USING (joinme) GROUP BY name;

# and indeed, it gets the expected 26 rows, with 0 filled in for the count when there's no match in the data table.

So far, so good. Remember, I don’t want to join all of the entries in the dynamic table….because it could be millions of entries, and maybe I want to limit it to id’s of a certain number. So what if I want the distribution of names given the first 15 id’s?

Using:
SELECT s.name,COUNT(d.joinme) FROM static s LEFT JOIN dynamic d USING (joinme) WHERE d.id<15 group by name;

gives me the distribution, but only for the values that among those ids. So in one test, I get 13 rows, all with a count>0, instead of 26 rows, with some of them having a count of 0 and others having a count>0.

This breaks in the same way my more complex query.

How can I limit the values from one table and outer join them to another, retaining everything? This seems like a very simple outer join, but because I want to put limits, it gets tricky, and the intuitive stuff just does not work. I do not know if “partial outer join” is the right term, but it’s what I’ll use.

# similarly, using 0....: SELECT name,0 FROM static union SELECT s.name,COUNT(d.joinme) FROM static s LEFT JOIN dynamic d USING (joinme) where d.id<15 group by name order by name; +-------+---+ | name | 0 | +-------+---+ | aaaaa | 0 | | aaaaa | 1 | | bbbbb | 0 | | ccccc | 1 | | ccccc | 0 | .....

# Maybe a HAVING clause? SELECT s.name,COUNT(d.joinme),d.id FROM static s LEFT JOIN dynamic d USING (joinme) group by name having d.id<15; # does not work (and gives me an extra (and nonsensical) field, because you can't put something in a HAVING clause unless it's selected).

#putting a JOIN in the FROM clause (not a good idea, but I'd do it if it worked): SELECT s.name,COUNT(d.joinme) FROM static s LEFT JOIN (SELECT joinme,id from dynamic where id<15) d USING (joinme) where d.id<15 GROUP BY name;

So what’s the answer? Well, the answer actually lies in the JOIN, which is a bit tricky. Basically, I was thinking of ON and USING as “what you join the tables on.” While that’s true, particularly for USING, it’s a bit different for ON.

The manual page at:

http://dev.mysql.com/doc/refman/4.1/en/join.html

does not particularly make things clear.

However, the first comment makes it clear:
The join_condition is used for winnowing the rows of the “right half” that you want to use for the cross, whereas the WHERE clause is used for winnowing the composite.

In other words, putting a condition in the WHERE clause means it’s going to limit the results; if you put a condition in the join condition, it limits each part before the actual join. This was not intuitive to me at all. However, in playing around with different possibilities to send to the list a message of “I’m frustrated, here’s everything I’ve tried!”, I ended up finding the answer.

I hope this is helpful to others as well…..

This should be easy, right? It was not very intuitive for me.

First, the setup:

use test;

CREATE TABLE `dynamic` ( `id` int(10) unsigned NOT NULL auto_increment, `joinme` char(1) default NULL, PRIMARY KEY (`id`));

CREATE TABLE `static` ( `joinme` char(1) default NULL, `name` char(5) default NULL);

insert into static (joinme) VALUES('a'),('b'),('c'),('d'), ('e'),('f'),('g'),('h'),('i'), ('j'),('k'),('l'),('m'),('n'), ('o'),('p'),('q'),('r'),('s'), ('t'),('u'),('v'),('w'),('x'), ('y'),('z');

update static set name=repeat(joinme,5);

First, a test to make sure everything’s OK:
# The following will give you the random distribution you have just made: select count(id), joinme from dynamic group by joinme order by joinme;

# the result is < = 26 rows, depending on the random distribution.

# and indeed, it gets the expected 26 rows, with 0 filled in for the count when there's no match in the data table.

Using:
SELECT s.name,COUNT(d.joinme) FROM static s LEFT JOIN dynamic d USING (joinme) WHERE d.id<15 group by name;

This breaks in the same way my more complex query.

The manual page at:

http://dev.mysql.com/doc/refman/4.1/en/join.html

does not particularly make things clear.

I hope this is helpful to others as well…..

Those Who Can, Do

I finally got around to reading The Cathedral and the Bazaar — I’d known about the collection of essays since it was first published in 1999, and kept meaning to read it. Heck, that link goes to the entirety of the essays, free, online. However, I never did read it, but I recently received a copy of the book, and have been reading it on my commute to work.

It’s very engaging, and still accurate, even now. Of course I think about how MySQL’s growth and business model as I’m reading it.

One of the tenets that keeps appearing is that “Brooks’ law that adding more people adds more complexity and actually slows a project down does not apply to open source.” He gives a few reasons, mostly communication, but there’s one crucial point he misses.

Brooks’ law does not apply to open source because there are little or no startup or coming to speed costs. The first contact with a developer is usually “Here’s the patch I’m submitting.” Now, perhaps that’s included in the management overhead that the essays refer to. However, if that’s the case it is completely unclear. I do not think this changes any parts of the arguments, nor any of the conclusions. But I do think it is important to note that not only are the candidates for modifying open source software filtered on self-selection, they are filtered on ability to actually do the work.

In a traditional job interview, there is very little “testing” done. Sure, there are often questions, some technical, others not; but interviewing is a completely different environment than actually doing the job. Communication may be poorer due to nerves or self-doubt in an interview, when it might not otherwise occur during the actual job. Very occasionally, the traditional closed source world will give a job candidate a task to do on their own, and the completed task and resultant communication are the merits on which hiring is based.

If closed source hiring worked the same way open source hiring (whether for a volunteer position or paid) works, perhaps Brooks’ law would also not apply. Why is this not done, then? Mostly it’s seen that a new employee or candidate would need their hands held, or that they cannot be allowed to see the code, because it’s closed source, and the (or a) revenue stream.

At any rate, those who work and succeed on open source software, whether they garner a paycheck or not, are by definition good at what they do, because they’ve already succeeded. In a traditional job, the costs of finding someone new and the liability of letting someone go because they “aren’t producing” are high. Therefore, it’s much easier to coast in these positions, and there are some folks who do. I have seen some rather impressive resumes for some rather unimpressive job candidates/co-workers.

I have seen some pretty unimpressive open source software (heck, I’ve written some myself); but for the most part, it all works as described. It may lack features, be very inefficient, or be limited on which systems it runs, but it almost always works without hacking. It’s difficult to say the same about closed source; one only needs to look at PeopleSoft or Remedy to see that. I’m sure there are plenty of examples; those are two examples I’m familiar with.

At any rate, the point is, those who can, do. That’s pretty much the motto of open source. I have some co-workers who rib me for donating to the EFF in order to get my name in the MySQL Source Code, because “it’s open source. You can get in there much more easily by submitting a patch.” This is true, in theory. In reality, I haven’t written any C code nor looked at code more complicated than a shell script in 5 years. One of the projects in the pipeline is to delve into the source code, but I’m not there yet.

So instead of submitting patches, I run the Boston MySQL User Group. I write articles in this blog. I help others optimize queries and MySQL configurations. We all do what we can, and that’s what makes open source great.

So the point of this? Do not be intimidated or discouraged because others are successful. Use them as a model, work hard, do what you can do, and you will be successful, too.

It’s very engaging, and still accurate, even now. Of course I think about how MySQL’s growth and business model as I’m reading it.

So the point of this? Do not be intimidated or discouraged because others are successful. Use them as a model, work hard, do what you can do, and you will be successful, too.

Certification By the Numbers

So,

http://www.mysql.com/training/certification/candidates.php?exam=core

has 233 names for the core exam, which has been out for a few years now. I passed it in 2004, and sure enough, my name is one of the 75 in the US. Are there really so few people out there, or are they just not taking/passing the exam?

I did notice at the MySQL Users Conference that only 1/4 to 1/3 of the exams earned a passing grade, and given that you have to pass 2 exams to receive certification, the chances of passing are not in your favor. Granted, studying helps (as does experience), which is how I passed the Core exam in 2004 and then both 5.0 DBA exams in April (2006).

Like much of the rest of life, it’s not difficult, you just have to work at it a bit. Read the certification guide, and study, and use MySQL, and you’ll pass.

I’m quite amazed at the low amount of people who are actually certified. I guess it’s better for me; 1 of 75 in the country and 1 of 233 in the world that are Core certified. I can’t wait to see the numbers when the 5.0 certification results are up.

Perhaps that will be Roland’s first task…. 😉

So,

http://www.mysql.com/training/certification/candidates.php?exam=core

Like much of the rest of life, it’s not difficult, you just have to work at it a bit. Read the certification guide, and study, and use MySQL, and you’ll pass.

Perhaps that will be Roland’s first task…. 😉

MySQL Related Nagios Plugins

After seeing Frank’s script in “Finding out how far behind are slaves” I figured I would post some nagios plugins I wrote a while ago, that we use at our company. They’re already up at http://www.nagiosexchange.org/, the 3rd party repository for nagios plugins.

So I figured I’d point you to:
Replication Lag Time
and
InnoDB Free Space

I’ll note that backups from a slave server can cause replication lag time to fall behind, for 2 reasons — mysqldump falls under “heavy SELECTs” as Frank has mentioned, but also zipping up logs or general high load/high CPU usage on a server can also cause the database to fall behind, because less CPU time is given to MySQL.

So I figured I’d point you to:
Replication Lag Time
and
InnoDB Free Space

Jim Starkey Speaks, July Boston MySQL User Group Meeting

Please feel free to forward to interested parties.

Who: Jim Starkey at the Boston MySQL User Group
What: Falcon, the new MySQL storage engine
When:
Monday, July 10, 2006 at 7:00 PM
Where:
MIT Building E51, Room 372
Wadsworth and Amherst Streets
Cambridge, MA 02117
Steps from the Red line, plenty of free parking.

The July Boston MySQL User Group’s topic is Falcon, the new storage engine for MySQL. Creator Jim Starkey will speak. Jim Starkey has been writing database software for 20 years. He created BLOBs, multi-versioning concurrency for relational databases, cascading update triggers, event alerters, and more. Read more about him at http://tinyurl.com/lno4p and http://tinyurl.com/mym7d.

We will be meeting on MIT campus, close to the Kendall stop on the Red Line (subway). There is also plenty of free parking — you can park in ANY MIT lot after 3 pm, even if it says “parking by permit only”. We are in building E51, room 372.

Here is the URL for the MIT Map with the location of this building:
http://whereis.mit.edu/map-jpg?selection=E51&Buildings=go

This map shows the MBTA Kendall Stop:
http://whereis.mit.edu/map-jpg?selection=L5&Landmarks=go
(the stop is in red on that map, and you can see E51 in the bottom right)

Here are the URL’s for the parking lots:
http://whereis.mit.edu/map-jpg?selection=P4&Parking=go
http://whereis.mit.edu/map-jpg?selection=P5&Parking=go

Free pizza and soda will be served, so please RSVP accurately and arrive a few minutes early.

To RSVP anonymously, please login to the Meetup site (http://mysql.meetup.com/137) with the e-mail address “admin at sheeri dot com” and the password of “guest”.

Sponsors welcome, call Sheeri at 857-205-9786 for details.

Please feel free to forward to interested parties.

Here is the URL for the MIT Map with the location of this building:
http://whereis.mit.edu/map-jpg?selection=E51&Buildings=go

This map shows the MBTA Kendall Stop:
http://whereis.mit.edu/map-jpg?selection=L5&Landmarks=go
(the stop is in red on that map, and you can see E51 in the bottom right)

Here are the URL’s for the parking lots:
http://whereis.mit.edu/map-jpg?selection=P4&Parking=go
http://whereis.mit.edu/map-jpg?selection=P5&Parking=go

Free pizza and soda will be served, so please RSVP accurately and arrive a few minutes early.

To RSVP anonymously, please login to the Meetup site (http://mysql.meetup.com/137) with the e-mail address “admin at sheeri dot com” and the password of “guest”.

Sponsors welcome, call Sheeri at 857-205-9786 for details.

Populating Images Into a Test Database

Guiseppe Maxia recently posted on Filling Test Tables Quickly. However, he only gave examples for numbers, words and dates. This article will deal with randomly inserting images into a database — particularly useful for testing images stored in the database versus the filesystem. This article was inspired by Guiseppe’s article.

[note: there will be a post soon with my presentation on storing images in the MySQL filesystem, and there will be ample opportunity for discussion then, especially since everyone making a claim will have tested their system, given this code.]

Getting images into the database is not easy. All the tutorials I have read have used a programming language to get images into a database. Using cat and similar unix tools, I have attempted to be able to insert an image into a database using only the mysql client commandline tool. That failed — I would love to know if anyone has any tips on inserting an image into a database without using a programming language.

So, I compromised. I put 20 images into a table as a baseline, and then wrote a stored procedure to suit my needs. Basically, users can have up to 6 images, so I wanted to replicate that possibility in the test db.

This stored procedure takes in 3 integers: maximum # of images per uid, range of random images from the “originals” table, and # of uids to use when populating the table. You will also need a table called “originals” containing the original images. A sample table, with 20 images, can be downloaded from:
http://www.sheeri.net/public/originalimages.sql.gz.

These images were taken from:
http://www.takigen.co.jp/japanese/images/number without explicit permission. I have no idea what Takigen is as I do not read Japanese. I ran a Google Image Search for tiny pictures involving numbers, and found that website that way.

To import it, run at the shell prompt:

> gunzip originalimages.sql.gz > mysql dbname < originalimages.sql

(explanation below the code)

delimiter $$
drop procedure if exists populate_images $$ create procedure populate_images(max_images INT,num_originals INT,finalcount INT) not deterministic modifies sql data begin declare uid int default 0; declare num_images int default 0; declare image_num int default 0; declare image_counter int default 0; set max_images:=max_images+1; set num_originals:=num_originals+1; while uid < = finalcount do set image_counter:=0; set num_images:=FLOOR(RAND() * max_images); while image_counter < num_images do set image_num:=FLOOR(RAND() * num_originals); INSERT INTO images (uid,image,uploaded) SELECT uid,image,NOW() from originals where imageNum=image_num; set image_counter:=image_counter+1; end while; set uid:=uid+1; end while; end $$
delimiter ;

Explanation:
declare uid int default 0;
Initialize the counter for the current uid.

declare num_images int default 0;
Initialize the variable for the number of images per user; selected by random with a maximum as given in the 1st argument to the procedure.

declare image_num int default 0;
Initialze the variable that will hold which image to retrieve from the originals table. Selected by random for each image.

declare image_counter int default 0;
Initialize the counter for the image number; reset to 0 for each uid.

set max_images:=max_images+1; set num_originals:=num_originals+1;
Variables are increased by 1 because I used < in the code instead of <= . I guess one could change the code to reflect this. while uid < = finalcount do
loop through the appropriate # of uids…

set image_counter:=0; set num_images:=FLOOR(RAND() * max_images);
reset the image counter and get a new # of images for this particular uid, between 0 and max_images. If you want to specify between 1 and max_images, change to 1+FLOOR(RAND() * max_images)

while image_counter < num_images do set image_num:=FLOOR(RAND() * num_originals);
Get a random number between 1 and num_originals (ie, how many images there are to choose from).

INSERT INTO images (uid,image,uploaded) SELECT uid,image,NOW() from originals where imageNum=image_num;
Insert the random image and the uid and time uploaded into the images table.

set image_counter:=image_counter+1; end while;
increment the image counter — you are finished with this image for this uid

set uid:=uid+1; end while; Increment the uid counter -- you are finished with this uid

end $$
Procedure end

Giuseppe’s article talked about randomly filling things quickly. This isn’t as lightning fast as other methods. To wit:

5 uids produced 20 image entries and took 1.14 seconds. (using call populate_images(6,20,5);)
50 uids produced 126 image entries and took 7.37 seconds.
500 uids produced 1416 image entries and took 1 min 26.48 seconds.
5,000 uids produced 14049 image entries and took 14 min 40.68 seconds.
(it scales linearly per image entry, not suprisingly).
So 50,000 uids would take about 2.5 hours, and 500,000 entries would take about a day to randomly fill.

Alternatively, using this procedure to randomly fill 5 uids and then repeating
insert into images (image,uid,uploaded) select image,uid,NOW() from images;

Takes:
5 uids, 1.33 seconds
50 uids, 1.60 seconds
500 uids, 2.03 seconds
5000 uids, 3.08 seconds
50000 uids, 30.45 seconds
500000 uids, 8 min, 47.10 seconds

But that’s only got 5 really random ones, copied a bunch of times. It may not be random enough. How about 1,000 random ones, and then copy them (I didn’t change uids, although I should have, I just didn’t feel like coding it):

call populate_images(6,20,1000); insert into images (image,uid,uploaded) select image,uid,NOW() from images;
(repeating that last line)

Takes:
1000 uids, 2 min 53.66 seconds
5000 uids, 2 min 54.24 seconds
50000 uids, 3 min 31.91 seconds
500000 uids, 20 min 45.08 seconds

Comments and suggestions welcome.