Partial Outer Join

I was wracking my brain for a bit, and could not figure out how to do a partial outer join where I was limiting rows from one table. So, I went to post to the MySQL user list — in writing my post, I found the answer. This often happens. It’s not that I don’t have questions to ask the MySQL user list (or forums), it’s just that most of the time, while I’m writing up the question, I find my answer.

First, I simplified my question. Many people do this, but they leave out important information. The trick is to come up with a simplification that you test to make sure it breaks in the same way the more complex, “real” query breaks.

The problem is basically this:
I want an outer join of a static table with a data table, such that I get a report of how many rows in the data table match the static table — including “0” if there are none, but for a subset of the data in the data table. Consider a billing report of how many invoices were generated per date, for a specified time period.

This should be easy, right? It was not very intuitive for me.

In this system, I’m using MySQL 4.1.19-standard-log so stored procedures aren’t an option. Also, I’d like to avoid a loop — the standard way of doing this (the “developer” way) is to get all the values in the static table, do a count of rows in the dynamic table for each row in the static table, and then display. However, I wanted to do it in very few queries, and so that the number of queries did not depend on the static data. I call this doing it the “DBA way”.

First, the setup:

use test;

CREATE TABLE `dynamic` (
`id` int(10) unsigned NOT NULL auto_increment,
`joinme` char(1) default NULL,
PRIMARY KEY (`id`));

CREATE TABLE `static`
( `joinme` char(1) default NULL,
`name` char(5) default NULL);

insert into static (joinme) VALUES('a'),('b'),('c'),('d'),
('e'),('f'),('g'),('h'),('i'),
('j'),('k'),('l'),('m'),('n'),
('o'),('p'),('q'),('r'),('s'),
('t'),('u'),('v'),('w'),('x'),
('y'),('z');

update static set name=repeat(joinme,5);

insert into dynamic (joinme) select joinme from static order by rand() limit 10;
insert into dynamic (joinme) select joinme from static order by rand() limit 10;
insert into dynamic (joinme) select joinme from static order by rand() limit 10;
insert into dynamic (joinme) select joinme from static order by rand() limit 10;
insert into dynamic (joinme) select joinme from static order by rand() limit 10;

First, a test to make sure everything’s OK:
# The following will give you the random distribution you have just made:
select count(id), joinme
from dynamic
group by joinme order by joinme;

# the result is < = 26 rows, depending on the random distribution.

# the following will show the distribution of all 26 names (aaaaa through zzzzz) represented in the dynamic table (well, all 26 joinme values, really, but that's irrelevant to the point)
SELECT s.name,COUNT(d.joinme)
FROM static s LEFT JOIN dynamic d USING (joinme)
GROUP BY name;

# and indeed, it gets the expected 26 rows, with 0 filled in for the count when there's no match in the data table.

So far, so good. Remember, I don’t want to join all of the entries in the dynamic table….because it could be millions of entries, and maybe I want to limit it to id’s of a certain number. So what if I want the distribution of names given the first 15 id’s?

Using:
SELECT s.name,COUNT(d.joinme)
FROM static s LEFT JOIN dynamic d USING (joinme)
WHERE d.id<15 group by name;

gives me the distribution, but only for the values that among those ids. So in one test, I get 13 rows, all with a count>0, instead of 26 rows, with some of them having a count of 0 and others having a count>0.

This breaks in the same way my more complex query.

How can I limit the values from one table and outer join them to another, retaining everything? This seems like a very simple outer join, but because I want to put limits, it gets tricky, and the intuitive stuff just does not work. I do not know if “partial outer join” is the right term, but it’s what I’ll use.

# UNION does not duplicate values, but only if all the values match:
SELECT name,'' FROM static union SELECT s.name,COUNT(d.joinme) FROM static s LEFT JOIN dynamic d USING (joinme) where d.id<15 group by name order by name;
gives:
+-------+---+
| name | |
+-------+---+
| aaaaa | |
| aaaaa | 1 |
| bbbbb | |
| ccccc | |
| ccccc | 1 |
....

# similarly, using 0....:
SELECT name,0 FROM static union SELECT s.name,COUNT(d.joinme) FROM static s LEFT JOIN dynamic d USING (joinme) where d.id<15 group by name order by name;
+-------+---+
| name | 0 |
+-------+---+
| aaaaa | 0 |
| aaaaa | 1 |
| bbbbb | 0 |
| ccccc | 1 |
| ccccc | 0 |
.....

# Maybe a HAVING clause?
SELECT s.name,COUNT(d.joinme),d.id FROM static s LEFT JOIN dynamic d USING (joinme) group by name having d.id<15;
# does not work (and gives me an extra (and nonsensical) field, because you can't put something in a HAVING clause unless it's selected).

#putting a JOIN in the FROM clause (not a good idea, but I'd do it if it worked):
SELECT s.name,COUNT(d.joinme) FROM static s LEFT JOIN (SELECT joinme,id from dynamic where id<15) d USING (joinme) where d.id<15 GROUP BY name;


So what’s the answer? Well, the answer actually lies in the JOIN, which is a bit tricky. Basically, I was thinking of ON and USING as “what you join the tables on.” While that’s true, particularly for USING, it’s a bit different for ON.

The manual page at:

http://dev.mysql.com/doc/refman/4.1/en/join.html

does not particularly make things clear.

However, the first comment makes it clear:
The join_condition is used for winnowing the rows of the “right half” that you want to use for the cross, whereas the WHERE clause is used for winnowing the composite.

In other words, putting a condition in the WHERE clause means it’s going to limit the results; if you put a condition in the join condition, it limits each part before the actual join. This was not intuitive to me at all. However, in playing around with different possibilities to send to the list a message of “I’m frustrated, here’s everything I’ve tried!”, I ended up finding the answer.

I hope this is helpful to others as well…..

I was wracking my brain for a bit, and could not figure out how to do a partial outer join where I was limiting rows from one table. So, I went to post to the MySQL user list — in writing my post, I found the answer. This often happens. It’s not that I don’t have questions to ask the MySQL user list (or forums), it’s just that most of the time, while I’m writing up the question, I find my answer.

First, I simplified my question. Many people do this, but they leave out important information. The trick is to come up with a simplification that you test to make sure it breaks in the same way the more complex, “real” query breaks.

The problem is basically this:
I want an outer join of a static table with a data table, such that I get a report of how many rows in the data table match the static table — including “0” if there are none, but for a subset of the data in the data table. Consider a billing report of how many invoices were generated per date, for a specified time period.

This should be easy, right? It was not very intuitive for me.

In this system, I’m using MySQL 4.1.19-standard-log so stored procedures aren’t an option. Also, I’d like to avoid a loop — the standard way of doing this (the “developer” way) is to get all the values in the static table, do a count of rows in the dynamic table for each row in the static table, and then display. However, I wanted to do it in very few queries, and so that the number of queries did not depend on the static data. I call this doing it the “DBA way”.

First, the setup:

use test;

CREATE TABLE `dynamic` (
`id` int(10) unsigned NOT NULL auto_increment,
`joinme` char(1) default NULL,
PRIMARY KEY (`id`));

CREATE TABLE `static`
( `joinme` char(1) default NULL,
`name` char(5) default NULL);

insert into static (joinme) VALUES('a'),('b'),('c'),('d'),
('e'),('f'),('g'),('h'),('i'),
('j'),('k'),('l'),('m'),('n'),
('o'),('p'),('q'),('r'),('s'),
('t'),('u'),('v'),('w'),('x'),
('y'),('z');

update static set name=repeat(joinme,5);

insert into dynamic (joinme) select joinme from static order by rand() limit 10;
insert into dynamic (joinme) select joinme from static order by rand() limit 10;
insert into dynamic (joinme) select joinme from static order by rand() limit 10;
insert into dynamic (joinme) select joinme from static order by rand() limit 10;
insert into dynamic (joinme) select joinme from static order by rand() limit 10;

First, a test to make sure everything’s OK:
# The following will give you the random distribution you have just made:
select count(id), joinme
from dynamic
group by joinme order by joinme;

# the result is < = 26 rows, depending on the random distribution.

# the following will show the distribution of all 26 names (aaaaa through zzzzz) represented in the dynamic table (well, all 26 joinme values, really, but that's irrelevant to the point)
SELECT s.name,COUNT(d.joinme)
FROM static s LEFT JOIN dynamic d USING (joinme)
GROUP BY name;

# and indeed, it gets the expected 26 rows, with 0 filled in for the count when there's no match in the data table.

So far, so good. Remember, I don’t want to join all of the entries in the dynamic table….because it could be millions of entries, and maybe I want to limit it to id’s of a certain number. So what if I want the distribution of names given the first 15 id’s?

Using:
SELECT s.name,COUNT(d.joinme)
FROM static s LEFT JOIN dynamic d USING (joinme)
WHERE d.id<15 group by name;

gives me the distribution, but only for the values that among those ids. So in one test, I get 13 rows, all with a count>0, instead of 26 rows, with some of them having a count of 0 and others having a count>0.

This breaks in the same way my more complex query.

How can I limit the values from one table and outer join them to another, retaining everything? This seems like a very simple outer join, but because I want to put limits, it gets tricky, and the intuitive stuff just does not work. I do not know if “partial outer join” is the right term, but it’s what I’ll use.

# UNION does not duplicate values, but only if all the values match:
SELECT name,'' FROM static union SELECT s.name,COUNT(d.joinme) FROM static s LEFT JOIN dynamic d USING (joinme) where d.id<15 group by name order by name;
gives:
+-------+---+
| name | |
+-------+---+
| aaaaa | |
| aaaaa | 1 |
| bbbbb | |
| ccccc | |
| ccccc | 1 |
....

# similarly, using 0....:
SELECT name,0 FROM static union SELECT s.name,COUNT(d.joinme) FROM static s LEFT JOIN dynamic d USING (joinme) where d.id<15 group by name order by name;
+-------+---+
| name | 0 |
+-------+---+
| aaaaa | 0 |
| aaaaa | 1 |
| bbbbb | 0 |
| ccccc | 1 |
| ccccc | 0 |
.....

# Maybe a HAVING clause?
SELECT s.name,COUNT(d.joinme),d.id FROM static s LEFT JOIN dynamic d USING (joinme) group by name having d.id<15;
# does not work (and gives me an extra (and nonsensical) field, because you can't put something in a HAVING clause unless it's selected).

#putting a JOIN in the FROM clause (not a good idea, but I'd do it if it worked):
SELECT s.name,COUNT(d.joinme) FROM static s LEFT JOIN (SELECT joinme,id from dynamic where id<15) d USING (joinme) where d.id<15 GROUP BY name;


So what’s the answer? Well, the answer actually lies in the JOIN, which is a bit tricky. Basically, I was thinking of ON and USING as “what you join the tables on.” While that’s true, particularly for USING, it’s a bit different for ON.

The manual page at:

http://dev.mysql.com/doc/refman/4.1/en/join.html

does not particularly make things clear.

However, the first comment makes it clear:
The join_condition is used for winnowing the rows of the “right half” that you want to use for the cross, whereas the WHERE clause is used for winnowing the composite.

In other words, putting a condition in the WHERE clause means it’s going to limit the results; if you put a condition in the join condition, it limits each part before the actual join. This was not intuitive to me at all. However, in playing around with different possibilities to send to the list a message of “I’m frustrated, here’s everything I’ve tried!”, I ended up finding the answer.

I hope this is helpful to others as well…..