Tuesday, February 9, 2010

Finding Rows with No Match in Another Table

The preceding sections focused on finding matches between two tables. But the answers to some questions require determining which records do not have a match (or, stated another way, which records have values that are missing from the other table). For example, you might want to know which artists in the artist table you don't yet have any paintings by. The same kind of question occurs in other contexts, such as:

*

You're working in sales. You have a list of potential customers, and another list of people who have placed orders. To focus your efforts on people who are not yet actual customers, you want to find people in the first list that are not in the second.
*

You have one list of baseball players, another list of players who have hit home runs, and you want to know which players in the first list have not hit a home run. The answer is determined by finding those players in the first list who are not in the second.

For these types of questions, you need to use a LEFT JOIN.

To see why, let's determine which artists in the artist table are missing from the painting table. At present, the tables are small, so it's easy to examine them visually and determine that you have no paintings by Monet and Picasso (there are no painting records with an a_id value of 2 or 4):

mysql> SELECT * FROM artist ORDER BY a_id;
+------+----------+
| a_id | name |
+------+----------+
| 1 | Da Vinci |
| 2 | Monet |
| 3 | Van Gogh |
| 4 | Picasso |
| 5 | Renoir |
+------+----------+
mysql> SELECT * FROM painting ORDER BY a_id, p_id;
+------+------+-------------------+-------+-------+
| a_id | p_id | title | state | price |
+------+------+-------------------+-------+-------+
| 1 | 1 | The Last Supper | IN | 34 |
| 1 | 2 | The Mona Lisa | MI | 87 |
| 3 | 3 | Starry Night | KY | 48 |
| 3 | 4 | The Potato Eaters | KY | 67 |
| 3 | 5 | The Rocks | IA | 33 |
| 5 | 6 | Les Deux Soeurs | NE | 64 |
+------+------+-------------------+-------+-------+

But as you acquire more paintings and the tables get larger, it won't be so easy to eyeball them and answer the question by inspection. Can you answer the question using SQL? Sure, although first attempts at solving the problem generally look something like the following query, using a WHERE clause that looks for mismatches between the two tables:

mysql> SELECT * FROM artist, painting WHERE artist.a_id != painting.a_id;
+------+----------+------+------+-------------------+-------+-------+
| a_id | name | a_id | p_id | title | state | price |
+------+----------+------+------+-------------------+-------+-------+
| 2 | Monet | 1 | 1 | The Last Supper | IN | 34 |
| 3 | Van Gogh | 1 | 1 | The Last Supper | IN | 34 |
| 4 | Picasso | 1 | 1 | The Last Supper | IN | 34 |
| 5 | Renoir | 1 | 1 | The Last Supper | IN | 34 |
| 2 | Monet | 1 | 2 | The Mona Lisa | MI | 87 |
| 3 | Van Gogh | 1 | 2 | The Mona Lisa | MI | 87 |
| 4 | Picasso | 1 | 2 | The Mona Lisa | MI | 87 |
| 5 | Renoir | 1 | 2 | The Mona Lisa | MI | 87 |
| 1 | Da Vinci | 3 | 3 | Starry Night | KY | 48 |
| 2 | Monet | 3 | 3 | Starry Night | KY | 48 |
| 4 | Picasso | 3 | 3 | Starry Night | KY | 48 |
| 5 | Renoir | 3 | 3 | Starry Night | KY | 48 |
| 1 | Da Vinci | 3 | 4 | The Potato Eaters | KY | 67 |
| 2 | Monet | 3 | 4 | The Potato Eaters | KY | 67 |
| 4 | Picasso | 3 | 4 | The Potato Eaters | KY | 67 |
| 5 | Renoir | 3 | 4 | The Potato Eaters | KY | 67 |
| 1 | Da Vinci | 3 | 5 | The Rocks | IA | 33 |
| 2 | Monet | 3 | 5 | The Rocks | IA | 33 |
| 4 | Picasso | 3 | 5 | The Rocks | IA | 33 |
| 5 | Renoir | 3 | 5 | The Rocks | IA | 33 |
| 1 | Da Vinci | 5 | 6 | Les Deux Soeurs | NE | 64 |
| 2 | Monet | 5 | 6 | Les Deux Soeurs | NE | 64 |
| 3 | Van Gogh | 5 | 6 | Les Deux Soeurs | NE | 64 |
| 4 | Picasso | 5 | 6 | Les Deux Soeurs | NE | 64 |
+------+----------+------+------+-------------------+-------+-------+

That's obviously not the correct result! The query produces a list of all combinations of values from the two rows where the values aren't the same, but what you really want is a list of values in artist that aren't present at all in painting. The trouble here is that a regular join can only produce combinations from values that are present in the tables. It can't tell you anything about values that are missing.

When faced with the problem of finding values in one table that have no match in (or that are missing from) another table, you should get in the habit of thinking, "aha, that's a LEFT JOIN problem." A LEFT JOIN is similar to a regular join in that it attempts to match rows in the first (left) table with the rows in the second (right) table. But in addition, if a left table row has no match in the right table, a LEFT JOIN still produces a row—one in which all the columns from the right table are set to NULL. This means you can find values that are missing from the right table by looking for NULL. It's easier to observe how this happens by working in stages. First, run a regular join to find matching rows:

mysql> SELECT * FROM artist, painting
-> WHERE artist.a_id = painting.a_id;
+------+----------+------+------+-------------------+-------+-------+
| a_id | name | a_id | p_id | title | state | price |
+------+----------+------+------+-------------------+-------+-------+
| 1 | Da Vinci | 1 | 1 | The Last Supper | IN | 34 |
| 1 | Da Vinci | 1 | 2 | The Mona Lisa | MI | 87 |
| 3 | Van Gogh | 3 | 3 | Starry Night | KY | 48 |
| 3 | Van Gogh | 3 | 4 | The Potato Eaters | KY | 67 |
| 3 | Van Gogh | 3 | 5 | The Rocks | IA | 33 |
| 5 | Renoir | 5 | 6 | Les Deux Soeurs | NE | 64 |
+------+----------+------+------+-------------------+-------+-------+

In this output, the first a_id column comes from the artist table and the second one comes from the painting table.

Now compare that result with the output you get from a LEFT JOIN. A LEFT JOIN is written in somewhat similar fashion, but you separate the table names by LEFT JOIN rather than by a comma, and specify which columns to compare using an ON clause rather than a WHERE clause:

mysql> SELECT * FROM artist LEFT JOIN painting
-> ON artist.a_id = painting.a_id;
+------+----------+------+------+-------------------+-------+-------+
| a_id | name | a_id | p_id | title | state | price |
+------+----------+------+------+-------------------+-------+-------+
| 1 | Da Vinci | 1 | 1 | The Last Supper | IN | 34 |
| 1 | Da Vinci | 1 | 2 | The Mona Lisa | MI | 87 |
| 2 | Monet | NULL | NULL | NULL | NULL | NULL |
| 3 | Van Gogh | 3 | 3 | Starry Night | KY | 48 |
| 3 | Van Gogh | 3 | 4 | The Potato Eaters | KY | 67 |
| 3 | Van Gogh | 3 | 5 | The Rocks | IA | 33 |
| 4 | Picasso | NULL | NULL | NULL | NULL | NULL |
| 5 | Renoir | 5 | 6 | Les Deux Soeurs | NE | 64 |
+------+----------+------+------+-------------------+-------+-------+

The output is similar to that from the regular join, except that the LEFT JOIN also produces an output row for artist rows that have no painting table match. For those output rows, all the columns from painting are set to NULL.

Next, to restrict the output only to the non-matched artist rows, add a WHERE clause that looks for NULL values in the painting column that is named in the ON clause:

mysql> SELECT * FROM artist LEFT JOIN painting
-> ON artist.a_id = painting.a_id
-> WHERE painting.a_id IS NULL;
+------+---------+------+------+-------+-------+
| a_id | name | a_id | p_id | title | price |
+------+---------+------+------+-------+-------+
| 2 | Monet | NULL | NULL | NULL | NULL |
| 4 | Picasso | NULL | NULL | NULL | NULL |
+------+---------+------+------+-------+-------+

Finally, to show only the artist table values that are missing from the painting table, shorten the output column list to include only columns from the artist table:

mysql> SELECT artist.* FROM artist LEFT JOIN painting
-> ON artist.a_id = painting.a_id
-> WHERE painting.a_id IS NULL;
+------+---------+
| a_id | name |
+------+---------+
| 2 | Monet |
| 4 | Picasso |
+------+---------+

The preceding LEFT JOIN lists those left-table values that are not present in the right table. A similar kind of operation can be used to report each left-table value along with an indicator whether or not it's present in the right table. To do this, perform a LEFT JOIN to count the number of times each left-table value occurs in the right table. A count of zero indicates the value is not present. The following query lists each artist from the artist table, and whether or not you have any paintings by the artist:

mysql> SELECT artist.name,
-> IF(COUNT(painting.a_id)>0,'yes','no') AS 'in collection'
-> FROM artist LEFT JOIN painting ON artist.a_id = painting.a_id
-> GROUP BY artist.name;
+----------+---------------+
| name | in collection |
+----------+---------------+
| Da Vinci | yes |
| Monet | no |
| Picasso | no |
| Renoir | yes |
| Van Gogh | yes |
+----------+---------------+

As of MySQL 3.23.25, you can also use RIGHT JOIN, which is like LEFT JOIN but reverses the roles of the left and right tables. In other words, RIGHT JOIN forces the matching process to produce a row from each table in the right table, even in the absence of a corresponding row in the left table. This means you would rewrite the preceding LEFT JOIN as follows to convert it to a RIGHT JOIN that produces the same results:

mysql> SELECT artist.name,
-> IF(COUNT(painting.a_id)>0,'yes','no') AS 'in collection'
-> FROM painting RIGHT JOIN artist ON painting.a_id = artist.a_id
-> GROUP BY artist.name;
+----------+---------------+
| name | in collection |
+----------+---------------+
| Da Vinci | yes |
| Monet | no |
| Picasso | no |
| Renoir | yes |
| Van Gogh | yes |
+----------+---------------+

Elsewhere in this book, I'll generally refer in discussion only to LEFT JOIN for brevity, but the discussions apply to RIGHT JOIN as well if you reverse the roles of the tables.

No comments:

Post a Comment