Pros and Cons of Descriptive Foreign Keys?

Given the following scenario:

Addresses
name group city
Johnny Friends Cambridge
Kevin Friends Allston
Justin Coworkers Easton
Aya Coworkers North Andover
Josh Family Groton
Connie Family Easton

In a database, that could be represented as such:

CREATE TABLE GroupNum (
groupId TINYINT UNSIGNED NOT NULL PRIMARY KEY AUTO_INCREMENT,
groupName VARCHAR(20)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE AddressNum (
firstName VARCHAR(20),
groupId TINYINT UNSIGNED NOT NULL,
city VARCHAR(20),
KEY (groupId),
FOREIGN KEY (groupId) REFERENCES GroupNum (groupId)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

INSERT INTO GroupNum (groupName) VALUES ('Friends'),('Coworkers'),('Family');
INSERT INTO AddressNum (firstName,groupId,city) VALUES
('Johnny',1,'Cambridge'),
('Kevin',1,'Allston'),
('Justin',2,'Easton'),
('Aya',2,'North Andover'),
('Josh',3,'Groton'),
('Connie',3,'Easton');

SELECT firstName,groupName,city
FROM AddressNum INNER JOIN GroupNum USING (groupId);

firstName groupName city
Johnny Friends Cambridge
Kevin Friends Allston
Justin Coworkers Easton
Aya Coworkers North Andover
Josh Family Groton
Connie Family Easton

or, using descriptive foreign keys:

CREATE TABLE GroupText (
groupName VARCHAR(20) NOT NULL PRIMARY KEY
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE AddressText (
firstName VARCHAR(20),
groupName VARCHAR(20) NOT NULL,
city VARCHAR(20),
KEY (groupName),
FOREIGN KEY (groupName) REFERENCES GroupText (groupName)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

INSERT INTO GroupText (groupName) VALUES ('Friends'),('Coworkers'),('Family');
INSERT INTO AddressText (firstName,groupName,city) VALUES
('Johnny','Friends','Cambridge'),
('Kevin','Friends','Allston'),
('Justin','Coworkers','Easton'),
('Aya','Coworkers','North Andover'),
('Josh','Family','Groton'),
('Connie','Family','Easton');

SELECT firstName,groupName,city
FROM AddressText;

firstName groupName city
Johnny Friends Cambridge
Kevin Friends Allston
Justin Coworkers Easton
Aya Coworkers North Andover
Josh Family Groton
Connie Family Easton

So what are the pros and cons of using descriptive foreign keys, as opposed to surrogate keys?

Con:
1) An auto-increment surrogate key (a number) is referenced faster than strings, so INSERTS will be slightly slower.
2) Data redundancy. Foreign keys need to be updated (and/or deleted) whenever the data is updated.

Pro:
1) Using a surrogate key means that to get user-readable data (ie, the text) a JOIN must be made. Using a descriptive foreign key, no JOIN needs to be made to get user-readable data. This includes reading with SELECTs and when INSERTing or modifying data. For example, to insert into AddressNum the groupId is needed.
2) Updates/deletes on a foreign key can be set to cascade.
3) Human-readable data in the database. This will make it easier for the DBA to debug data issues, as they can use a SHOW CREATE TABLE to find the foreign key references, and then get a sample of data from one table and understand the scenario, without having to JOIN.

My thinking right now is that I would use a descriptive foreign key where the forign key is “static” data, or the foreign key references are used in a small number of tables.

For instance, in a user-based application, I would not use “username” as a descriptive foreign key, I would stick with a surrogate key of “uid”, because the foreign key is ubiquitous, and if a user changed their username, every single table would need to cascade — or else the update would fail.

However, I would use descriptive foreign keys for “static” data like locale (language) data, country and state names, because they likely will not change, even though they might affect many changes. Athough note that in these places, I’m also using standards — “locale” for language, ISO country and state/province codes, etc.

I would also use it in the scenario above, where the data is not static, but also does not touch many tables — the ‘group name’ only affects the ‘listings’ table if it gets changed.

So, what are the pros and cons of this method? When would you and would you not use it?

Given the following scenario:

Addresses
name group city
Johnny Friends Cambridge
Kevin Friends Allston
Justin Coworkers Easton
Aya Coworkers North Andover
Josh Family Groton
Connie Family Easton

In a database, that could be represented as such:

CREATE TABLE GroupNum (
groupId TINYINT UNSIGNED NOT NULL PRIMARY KEY AUTO_INCREMENT,
groupName VARCHAR(20)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE AddressNum (
firstName VARCHAR(20),
groupId TINYINT UNSIGNED NOT NULL,
city VARCHAR(20),
KEY (groupId),
FOREIGN KEY (groupId) REFERENCES GroupNum (groupId)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

INSERT INTO GroupNum (groupName) VALUES ('Friends'),('Coworkers'),('Family');
INSERT INTO AddressNum (firstName,groupId,city) VALUES
('Johnny',1,'Cambridge'),
('Kevin',1,'Allston'),
('Justin',2,'Easton'),
('Aya',2,'North Andover'),
('Josh',3,'Groton'),
('Connie',3,'Easton');

SELECT firstName,groupName,city
FROM AddressNum INNER JOIN GroupNum USING (groupId);

firstName groupName city
Johnny Friends Cambridge
Kevin Friends Allston
Justin Coworkers Easton
Aya Coworkers North Andover
Josh Family Groton
Connie Family Easton

or, using descriptive foreign keys:

CREATE TABLE GroupText (
groupName VARCHAR(20) NOT NULL PRIMARY KEY
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE AddressText (
firstName VARCHAR(20),
groupName VARCHAR(20) NOT NULL,
city VARCHAR(20),
KEY (groupName),
FOREIGN KEY (groupName) REFERENCES GroupText (groupName)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

INSERT INTO GroupText (groupName) VALUES ('Friends'),('Coworkers'),('Family');
INSERT INTO AddressText (firstName,groupName,city) VALUES
('Johnny','Friends','Cambridge'),
('Kevin','Friends','Allston'),
('Justin','Coworkers','Easton'),
('Aya','Coworkers','North Andover'),
('Josh','Family','Groton'),
('Connie','Family','Easton');

SELECT firstName,groupName,city
FROM AddressText;

firstName groupName city
Johnny Friends Cambridge
Kevin Friends Allston
Justin Coworkers Easton
Aya Coworkers North Andover
Josh Family Groton
Connie Family Easton

So what are the pros and cons of using descriptive foreign keys, as opposed to surrogate keys?

Con:
1) An auto-increment surrogate key (a number) is referenced faster than strings, so INSERTS will be slightly slower.
2) Data redundancy. Foreign keys need to be updated (and/or deleted) whenever the data is updated.

Pro:
1) Using a surrogate key means that to get user-readable data (ie, the text) a JOIN must be made. Using a descriptive foreign key, no JOIN needs to be made to get user-readable data. This includes reading with SELECTs and when INSERTing or modifying data. For example, to insert into AddressNum the groupId is needed.
2) Updates/deletes on a foreign key can be set to cascade.
3) Human-readable data in the database. This will make it easier for the DBA to debug data issues, as they can use a SHOW CREATE TABLE to find the foreign key references, and then get a sample of data from one table and understand the scenario, without having to JOIN.

My thinking right now is that I would use a descriptive foreign key where the forign key is “static” data, or the foreign key references are used in a small number of tables.

For instance, in a user-based application, I would not use “username” as a descriptive foreign key, I would stick with a surrogate key of “uid”, because the foreign key is ubiquitous, and if a user changed their username, every single table would need to cascade — or else the update would fail.

However, I would use descriptive foreign keys for “static” data like locale (language) data, country and state names, because they likely will not change, even though they might affect many changes. Athough note that in these places, I’m also using standards — “locale” for language, ISO country and state/province codes, etc.

I would also use it in the scenario above, where the data is not static, but also does not touch many tables — the ‘group name’ only affects the ‘listings’ table if it gets changed.

So, what are the pros and cons of this method? When would you and would you not use it?

15 thoughts on “Pros and Cons of Descriptive Foreign Keys?”

  1. One way around the e.g. Country question is to use an acronym – we have used the inernational 3-character airline code for this. Then the data in the database isn’t meaningless (the acronyms are quite easy to decipher), you don’t waste much space, and you can have a single-table update if the name-acronym does change

  2. If you have a lot of rows, the space saved by the surogate key can make a significate difference. You get more rows per page which reduces disk access, increases the percent of the table in kept in the cache, and reduces the depth of indexes. Also integer comparisions are faster than string comparisions especially with case-insensitive strings, so joins are improved. All of these can have a dramatic effect on perfomance in large tables.

  3. John,

    You’re going to have to give some more concrete data for that. I already noted that there’s more space taken up in “con #1”.

  4. Sherri,
    Maybe I’m just splitting hairs, but I think there is a difference between taking up more space (workaround: get bigger drives), and the effect that has on query performance. For large tables, query performance is largely dependant on the number of pages physically read from the disk. That in turn is largely dependent on the number of rows that can fit on a page (ignoring page splits and holes created by deletes).

    So for a ten million row table with 50 bytes per row, you can get about 160 rows per 8K page (assuming a bit of overhead used by the system) which means 62500 pages in a table scan. Figure about 150 reads/sec on a decent drive, and that takes about 420 seconds. If your rows are 45 bytes, you get about 177 rows per page, and 56480 pages or 370 seconds which is a noticeable improvement. This is similar to going from a 7-8 character “pretty” key to an int.

    For indexes it gets worse. Since only the index key is stored, going from an 8 character key to a 4 byte int key means twice as many index keys per trunk page which will ends up eliminating half the leaf nodes (again assuming no holes) which means one read saved per seek. For 30,000 indexed lookups in that table, you save about 15,000 page reads or about 100 seconds. I’m assuming the worst case of the leaf nodes not being cached (since the entire page cache is taken up by the data pages from the scan we just did in the previous example), but in my experience, the real world difference is still usually fairly dramatic.

Comments are closed.