Looking for data

I work as a QA Engineer in a “stealth mode” startup building a network storage appliance. I am looking for “real world” datasets to load into our appliance to profile performance and scalability of the product given different schema models populated real world distribution of data. I envision looking for two significantly different datasets. One is the “flat file” schema like historical or logging data from Web Server Access and Error logs. The other would be a relational (preferably star schema) database like reservation database or inventory control database.

The data doesn’t need to be current. And it can be scrubbed to remove “real” data. The data won’t be used outside the QA lab. Again, this is to test “how does the product work when data that lives in the outside world is loaded.”

Ultimately, I am looking for 2 to 10 Terabytes of composite data at the end of the project.

I work as a QA Engineer in a “stealth mode” startup building a network storage appliance. I am looking for “real world” datasets to load into our appliance to profile performance and scalability of the product given different schema models populated real world distribution of data. I envision looking for two significantly different datasets. One is the “flat file” schema like historical or logging data from Web Server Access and Error logs. The other would be a relational (preferably star schema) database like reservation database or inventory control database.

The data doesn’t need to be current. And it can be scrubbed to remove “real” data. The data won’t be used outside the QA lab. Again, this is to test “how does the product work when data that lives in the outside world is loaded.”

Ultimately, I am looking for 2 to 10 Terabytes of composite data at the end of the project.

Access Has “SET”, Recommends Not Using It

http://www.regdeveloper.co.uk/2006/07/18/multivalued_datatypes_access/

This is an interesting read — it would be awesome if MySQL just used the “SET” or “ENUM” data types to be a placeholder for a join table, that it would create automatically for you. Of course, that’s a new level of functionality — MySQL does not implicitly create permanent tables with any commands. But it would be neat.

http://www.regdeveloper.co.uk/2006/07/18/multivalued_datatypes_access/

This is an interesting read — it would be awesome if MySQL just used the “SET” or “ENUM” data types to be a placeholder for a join table, that it would create automatically for you. Of course, that’s a new level of functionality — MySQL does not implicitly create permanent tables with any commands. But it would be neat.

Top 10 Largest “Databases”

Thanx to Rich McIver for passing along this link:

http://www.businessintelligencelowdown.com/2007/02/top_10_largest_.html

I’m amused mostly because the article interchanges “database” with “data storage” — many of the sites have “digital documents” included in their count, and YouTube is in there completely with the amount of space their videos take up. But is all this stuff stored in databases? I do not think so. Anyone know for sure?

Thanx to Rich McIver for passing along this link:

http://www.businessintelligencelowdown.com/2007/02/top_10_largest_.html

I’m amused mostly because the article interchanges “database” with “data storage” — many of the sites have “digital documents” included in their count, and YouTube is in there completely with the amount of space their videos take up. But is all this stuff stored in databases? I do not think so. Anyone know for sure?

Women in Open Source

I stumbled across this on the fabulous http://www.everythingsysadmin.com :

from:  http://www.socallinuxexpo.org/wios07/

The Southern California Linux Expo (SCALE) will host a Women in Open Source Event as part of their upcoming 2007 conference, SCALE 5x.

The focus of this event is on the women in the open source and free software communities. The goal of this event is to encourage women to use technology, open source and free software, and to explore the obstacles that women face in breaking into the technology industry. The Women in Open Source event will be held on February 9, 2007 at the Los Angeles Airport Westin Hotel.

I have seen the dearth of women, in system administration as well as database administration, having worked in both fields.  Network administration has fewer women than both system administration and database administration. 

I have noted that there are more women in database administration than system administration.  The MySQL Conference I attended last year had the biggest percentage of women I’ve ever seen at a technical conference.  Checking out the names at http://www.planetmysql.org the first name there is female (mine), and then a few dozen names go by, and most of them are clearly identifiable as male.  There are only a few names whom I can’t place a face/picture to, and to whom I cannot assign a gender.

I’m not denying that barriers and prejudices exist. In fact, one of the reasons I opted to get a Master’s Degree in Computer Science is that I have no idea who’s looking at me, and consciously or subconsciously are thinking, “She’s a woman, she’s not as good as a man.”

Certainly, there are the usual barriers to entry in the scientific world.  Is open source any different from any other type of “boys’ club”?  My partner is a sleight-of-hand card magician, and he laments that there are few women in the field.  He also laments that he does not see women interested at all — it’s not that they’re starting and being turned away, or getting discouraged, it’s that they’re not even starting.  How would we know if we reach that point in the open source world? 

I have not met with any barriers to my career.  Then again, I grew up with 2 brothers (my sister is 7 years older, so I never really “grew up” with her), and adapted to “living in a world of boys/men” a long time ago.  Mostly it’s just grabbing the bull by the horns and just doing.

My issue with the lack of women in open source is that I wonder how far we can get into it without bringing up the stereotypes.  I am not a shy, quiet woman.  I am not ladylike in many ways.  Many “qualities” that American society teaches their girls and women — don’t interrupt, have a lack of self-confidence, looks are your most important asset so do not seem to intelligent — are the barriers to entry I’ve seen, though not encountered.  That, and the struggle between family and career.  I have seen more and more men struggle with this as well, so it’s not as much of a “female” issue as it has been in the past.

Professionally, I make sure to wear my glasses on a job interview or important meeting, and dress conservatively when I do (I’ve fallen back to wearing glasses all the time because I’ve been too lazy to get more contact lenses).  I actually don’t want to look too pretty on a job interview, and consciously think of this.  So I’m definitely aware of the discrimination that exists, but so far I have been lucky enough to not encounter it, or at least I’m not aware of encountering it.

It’s my belief that women make less than men for the same job because they are less likely to negotiate and hold out for more money, and more likely to take one of the first offers — “good women” don’t “rock the boat”, after all.  (tongue in cheek, of course)

As someone who is rather type A (see http://en.wikipedia.org/wiki/Type_A_personality), I know there are colleagues and co-workers who find me to be too strict, which translates to “she’s a bitch.”  For men, this turns into “he’s too strict and type A.”  The double-standard is annoying, though not something I particularly worry about, since it has not gotten in the way of advancement or recognition, and has not targeted me for anything negative in particular (that I know about).

So, what’s the point of this?  I think it’s good to teach girls and women the skills they need to make it in a male-dominated field.  But I feel it should be taught and instilled at an early age, to both girls AND boys, as “this is the way American/Western society works, and these are ways folks are successful.”

I just feel as though singling out women, while it’s accurate, is saying, “we’re going to put you in the special room because you haven’t learned something you should have” and therefore is somewhat degrading.  Not a lot degrading, but somewhat.

powered by performancing firefox

I stumbled across this on the fabulous http://www.everythingsysadmin.com :

from:  http://www.socallinuxexpo.org/wios07/

The Southern California Linux Expo (SCALE) will host a Women in Open Source Event as part of their upcoming 2007 conference, SCALE 5x.

The focus of this event is on the women in the open source and free software communities. The goal of this event is to encourage women to use technology, open source and free software, and to explore the obstacles that women face in breaking into the technology industry. The Women in Open Source event will be held on February 9, 2007 at the Los Angeles Airport Westin Hotel.

I have seen the dearth of women, in system administration as well as database administration, having worked in both fields.  Network administration has fewer women than both system administration and database administration. 

I have noted that there are more women in database administration than system administration.  The MySQL Conference I attended last year had the biggest percentage of women I’ve ever seen at a technical conference.  Checking out the names at http://www.planetmysql.org the first name there is female (mine), and then a few dozen names go by, and most of them are clearly identifiable as male.  There are only a few names whom I can’t place a face/picture to, and to whom I cannot assign a gender.

I’m not denying that barriers and prejudices exist. In fact, one of the reasons I opted to get a Master’s Degree in Computer Science is that I have no idea who’s looking at me, and consciously or subconsciously are thinking, “She’s a woman, she’s not as good as a man.”

Certainly, there are the usual barriers to entry in the scientific world.  Is open source any different from any other type of “boys’ club”?  My partner is a sleight-of-hand card magician, and he laments that there are few women in the field.  He also laments that he does not see women interested at all — it’s not that they’re starting and being turned away, or getting discouraged, it’s that they’re not even starting.  How would we know if we reach that point in the open source world? 

I have not met with any barriers to my career.  Then again, I grew up with 2 brothers (my sister is 7 years older, so I never really “grew up” with her), and adapted to “living in a world of boys/men” a long time ago.  Mostly it’s just grabbing the bull by the horns and just doing.

My issue with the lack of women in open source is that I wonder how far we can get into it without bringing up the stereotypes.  I am not a shy, quiet woman.  I am not ladylike in many ways.  Many “qualities” that American society teaches their girls and women — don’t interrupt, have a lack of self-confidence, looks are your most important asset so do not seem to intelligent — are the barriers to entry I’ve seen, though not encountered.  That, and the struggle between family and career.  I have seen more and more men struggle with this as well, so it’s not as much of a “female” issue as it has been in the past.

Professionally, I make sure to wear my glasses on a job interview or important meeting, and dress conservatively when I do (I’ve fallen back to wearing glasses all the time because I’ve been too lazy to get more contact lenses).  I actually don’t want to look too pretty on a job interview, and consciously think of this.  So I’m definitely aware of the discrimination that exists, but so far I have been lucky enough to not encounter it, or at least I’m not aware of encountering it.

It’s my belief that women make less than men for the same job because they are less likely to negotiate and hold out for more money, and more likely to take one of the first offers — “good women” don’t “rock the boat”, after all.  (tongue in cheek, of course)

As someone who is rather type A (see http://en.wikipedia.org/wiki/Type_A_personality), I know there are colleagues and co-workers who find me to be too strict, which translates to “she’s a bitch.”  For men, this turns into “he’s too strict and type A.”  The double-standard is annoying, though not something I particularly worry about, since it has not gotten in the way of advancement or recognition, and has not targeted me for anything negative in particular (that I know about).

So, what’s the point of this?  I think it’s good to teach girls and women the skills they need to make it in a male-dominated field.  But I feel it should be taught and instilled at an early age, to both girls AND boys, as “this is the way American/Western society works, and these are ways folks are successful.”

I just feel as though singling out women, while it’s accurate, is saying, “we’re going to put you in the special room because you haven’t learned something you should have” and therefore is somewhat degrading.  Not a lot degrading, but somewhat.

powered by performancing firefox

CHAR() vs. VARCHAR()

So, a little gotcha:

The CHAR() and VARCHAR() types are different types. MySQL silently converts any CHAR() fields to VARCHAR() when creating a table with at least 1 VARCHAR() field.

http://dev.mysql.com/doc/refman/5.0/en/silent-column-changes.html

If any column in a table has a variable length, the entire row becomes variable-length as a result. Therefore, if a table contains any variable-length columns (VARCHAR, TEXT, or BLOB), all CHAR columns longer than three characters are changed to VARCHAR columns. This does not affect how you use the columns in any way; in MySQL, VARCHAR is just a different way to store characters. MySQL performs this conversion because it saves space and makes table operations faster.

However, that’s not entirely accurate. Because according to the manual page at http://dev.mysql.com/doc/refman/5.0/en/char.html:

As of MySQL 5.0.3, trailing spaces are retained when values are stored and retrieved, in conformance with standard SQL. Before MySQL 5.0.3, trailing spaces are removed from values when they are stored into a VARCHAR column; this means that the spaces also are absent from retrieved values.

If you have a field such as name, and require it to not be blank, you probably have some function testing it before it goes into the database. However, most languages are perfectly happy that ” ” isn’t blank. When it gets put into the database, however, it becomes blank if your column is a VARCHAR. Which means folks may be able to get beyond your requirement of a blank field, and actually store a blank field in the database (as opposed to storing a space or series of spaces).

The CHAR() and VARCHAR() types are different types. MySQL silently converts any CHAR() fields to VARCHAR() when creating a table with at least 1 VARCHAR() field.

http://dev.mysql.com/doc/refman/5.0/en/silent-column-changes.html

If any column in a table has a variable length, the entire row becomes variable-length as a result. Therefore, if a table contains any variable-length columns (VARCHAR, TEXT, or BLOB), all CHAR columns longer than three characters are changed to VARCHAR columns. This does not affect how you use the columns in any way; in MySQL, VARCHAR is just a different way to store characters. MySQL performs this conversion because it saves space and makes table operations faster.

However, that’s not entirely accurate. Because according to the manual page at http://dev.mysql.com/doc/refman/5.0/en/char.html:

As of MySQL 5.0.3, trailing spaces are retained when values are stored and retrieved, in conformance with standard SQL. Before MySQL 5.0.3, trailing spaces are removed from values when they are stored into a VARCHAR column; this means that the spaces also are absent from retrieved values.

If you have a field such as name, and require it to not be blank, you probably have some function testing it before it goes into the database. However, most languages are perfectly happy that ” ” isn’t blank. When it gets put into the database, however, it becomes blank if your column is a VARCHAR. Which means folks may be able to get beyond your requirement of a blank field, and actually store a blank field in the database (as opposed to storing a space or series of spaces).

User Group Video Up, and Video Camera Review

Download the video at:

http://technocation.org/videos/original/BostonMySQLJanUserGroupBrianAkerLg.wmv
– 520 kbps, 320 x 240, 354M. Small size, low quality, but you can still see the slides and hear everything.

http://technocation.org/videos/original/BostonMySQLJanUserGroupBrianAkerSm.wmv – 45 kbps, 320 x 120, 29M. Small size, low quality, but you can still see the slides and hear everything.

Technocation, Inc. received a donation of a Sony Handycam DCR SR80 (http://tinyurl.com/yvyfam ), extra-long battery, microphone (proprietary Sony that goes with the camera).

In a short sentence: I am impressed. The sound quality (on the large version) is almost exactly what I heard. Granted, I have some hearing loss, but I forgot to bring the microphone, and you can still hear audience questions very well. The video quality is great, too. The hard disk is perfect, because files can be copied over or burned directly to DVD. It records in MPEG-4 format.

The 1 hour 38 minute talk took up less than 6 gigs of space raw (I forget how much exactly, but it cuts the files into 2G chunks, and there were 3). This gives at least 10 hours of recording time before needing to dump to disk. This is a very exciting prospect for the MySQL Conference and Expo at the end of April, I’ll be able to video a LOT.

Special thanks go to the User Group member (who may wish to remain anonymous, but I forgot his name anyway, so if he wishes to comment he can, or just e-mail me so I remember your name!) who talked to me about codecs and which programs to use, because they worked!

I was not quite ready for the start of the User Group, and had to run out to my car to get the tripod, so the first minute or so (until 1:25) is me setting up the tripod — I apologize for the movement.

You can see the “Night Shot” functionality early on, when I focus on Brian and turn it on. It does a great job, but loses a lot of color (1:53 until 1:59).

I was disappointed that when you connect the DC power supply, the video stops (so there’s a few hops int here).

Brian takes some slides, starts talking, and questions ensue. The basic slides were about MySQL’s internal architecture.

Some links:
MySQL and dual-master/circular replication
There’s a great article by Guiseppe Maxia at: http://www.onlamp.com/pub/a/onlamp/2006/04/20/advanced-mysql-replication.html

And a free chapter on Replication from Jeremy Zawodny’s
http://www.oreilly.com/catalog/hpmysql/chapter/ch07.pdf

Around 27:00 there is a reference to Jim Gray’s “Black Book”, which is entitled “Transaction Processing: Concepts and Techniques” and can be found here: http://tinyurl.com/2md3tb

http://forge.mysql.com/wiki/Top10SQLPerformanceTips

Download the video at:

http://technocation.org/videos/original/BostonMySQLJanUserGroupBrianAkerLg.wmv
– 520 kbps, 320 x 240, 354M. Small size, low quality, but you can still see the slides and hear everything.

http://technocation.org/videos/original/BostonMySQLJanUserGroupBrianAkerSm.wmv – 45 kbps, 320 x 120, 29M. Small size, low quality, but you can still see the slides and hear everything.

Technocation, Inc. received a donation of a Sony Handycam DCR SR80 (http://tinyurl.com/yvyfam ), extra-long battery, microphone (proprietary Sony that goes with the camera).

In a short sentence: I am impressed. The sound quality (on the large version) is almost exactly what I heard. Granted, I have some hearing loss, but I forgot to bring the microphone, and you can still hear audience questions very well. The video quality is great, too. The hard disk is perfect, because files can be copied over or burned directly to DVD. It records in MPEG-4 format.

The 1 hour 38 minute talk took up less than 6 gigs of space raw (I forget how much exactly, but it cuts the files into 2G chunks, and there were 3). This gives at least 10 hours of recording time before needing to dump to disk. This is a very exciting prospect for the MySQL Conference and Expo at the end of April, I’ll be able to video a LOT.

Special thanks go to the User Group member (who may wish to remain anonymous, but I forgot his name anyway, so if he wishes to comment he can, or just e-mail me so I remember your name!) who talked to me about codecs and which programs to use, because they worked!

I was not quite ready for the start of the User Group, and had to run out to my car to get the tripod, so the first minute or so (until 1:25) is me setting up the tripod — I apologize for the movement.

You can see the “Night Shot” functionality early on, when I focus on Brian and turn it on. It does a great job, but loses a lot of color (1:53 until 1:59).

I was disappointed that when you connect the DC power supply, the video stops (so there’s a few hops int here).

Brian takes some slides, starts talking, and questions ensue. The basic slides were about MySQL’s internal architecture.

Some links:
MySQL and dual-master/circular replication
There’s a great article by Guiseppe Maxia at: http://www.onlamp.com/pub/a/onlamp/2006/04/20/advanced-mysql-replication.html

And a free chapter on Replication from Jeremy Zawodny’s
http://www.oreilly.com/catalog/hpmysql/chapter/ch07.pdf

Around 27:00 there is a reference to Jim Gray’s “Black Book”, which is entitled “Transaction Processing: Concepts and Techniques” and can be found here: http://tinyurl.com/2md3tb

http://forge.mysql.com/wiki/Top10SQLPerformanceTips

Caveat: DATEs and DATETIMEs

A small bug/feature, which I created a new bug for at http://bugs.mysql.com/25706. Basically, the CURRENT_DATE() function seems to assign a time to dates, and that time is 00:00:00. In fact, all DATE formats are actually DATETIMEs with the date field of 00:00:00 and hidden.

This interferes with queries that use the date as an actual date and expect the date to include everything up until 23:59:59 of that day. The easiest way to reproduce this:

SELECT IF(NOW() BETWEEN '2007-01-17' AND '2007-01-18','yes','no') AS test\G
test: no
1 row in set (0.00 sec)

In fact, the following query always returns “no”, unless it’s exactly midnight:

SELECT IF(NOW() BETWEEN CURRENT_DATE() - INTERVAL 1 DAY AND CURRENT_DATE(),'yes','no') AS test\G
test: no
1 row in set (0.00 sec)

This does not make logical sense — NOW() should always be between “yesterday” and “today”. It’s one thing to be cautious not to compare DATEs and DATETIMEs to each other. However, this is a bit too much of a bug for me. And it’s not just the BETWEEN…AND syntax — that just converts to >= and < = to. For a test of this: SELECT IF(NOW() >= CURRENT_DATE() – INTERVAL 1 DAY AND NOW() < = CURRENT_DATE(),'yes','no') AS test\G test: no 1 row in set (0.00 sec)

Again, NOW() should fall between “yesterday” and “today”.

There is a workaround, although poor at best:

SELECT IF(DATE(NOW()) BETWEEN CURRENT_DATE() AND CURRENT_DATE(),'yes','no') AS test\G
test: yes
1 row in set (0.00 sec)

I found this with both MySQL 4.1.19-standard-log and 5.0.19-standard-log.

Now, (thanx to Roland) this is because when you cast a DATE to a DATETIME, it gets the 00:00:00 time put in. And when you’re comparing a DATE and a DATETIME, MySQL implicitly casts the DATE to DATETIME. It’s a wonderful gotcha, and I understand why it is, but I do not like it.

It also means that I have to go look at a whole bunch of reports…..

Oracle also works this way, and I’m guessing all db’s do, but I’d like confirmation if folks have any. What a pain.

A small bug/feature, which I created a new bug for at http://bugs.mysql.com/25706. Basically, the CURRENT_DATE() function seems to assign a time to dates, and that time is 00:00:00. In fact, all DATE formats are actually DATETIMEs with the date field of 00:00:00 and hidden.

This interferes with queries that use the date as an actual date and expect the date to include everything up until 23:59:59 of that day. The easiest way to reproduce this:

SELECT IF(NOW() BETWEEN '2007-01-17' AND '2007-01-18','yes','no') AS test\G
test: no
1 row in set (0.00 sec)

In fact, the following query always returns “no”, unless it’s exactly midnight:

SELECT IF(NOW() BETWEEN CURRENT_DATE() - INTERVAL 1 DAY AND CURRENT_DATE(),'yes','no') AS test\G
test: no
1 row in set (0.00 sec)

This does not make logical sense — NOW() should always be between “yesterday” and “today”. It’s one thing to be cautious not to compare DATEs and DATETIMEs to each other. However, this is a bit too much of a bug for me. And it’s not just the BETWEEN…AND syntax — that just converts to >= and < = to. For a test of this: SELECT IF(NOW() >= CURRENT_DATE() – INTERVAL 1 DAY AND NOW() < = CURRENT_DATE(),'yes','no') AS test\G test: no 1 row in set (0.00 sec)

Again, NOW() should fall between “yesterday” and “today”.

There is a workaround, although poor at best:

SELECT IF(DATE(NOW()) BETWEEN CURRENT_DATE() AND CURRENT_DATE(),'yes','no') AS test\G
test: yes
1 row in set (0.00 sec)

I found this with both MySQL 4.1.19-standard-log and 5.0.19-standard-log.

Now, (thanx to Roland) this is because when you cast a DATE to a DATETIME, it gets the 00:00:00 time put in. And when you’re comparing a DATE and a DATETIME, MySQL implicitly casts the DATE to DATETIME. It’s a wonderful gotcha, and I understand why it is, but I do not like it.

It also means that I have to go look at a whole bunch of reports…..

Oracle also works this way, and I’m guessing all db’s do, but I’d like confirmation if folks have any. What a pain.

MySQL Queues, part II — groups of queues

I believe this is a huge optimization for a heavily implemented Web 2.0 idea.

This article makes simple work of groups of queues. An example of this would be “the most recent 10 people to view an article,” so each article has a queue of up to 10 items in it. This method eliminates the need for multiple SQL statements or using TRIGGERS to check to see if the queue is full.

I bow down to Baron Schwartz, aka Xarpb, for his article on how to implement a queue in SQL:

http://www.xaprb.com/blog/2007/01/11/how-to-implement-a-queue-in-sql/

I am very excited because this also works for groups of objects, and we’re about to implement something at work that needs this idea. The idea of “the most recent x things” or “the top x things” is huge, especially in social networking, and probably one of the most often sought after features.

The biggest issue is that in order to display, say, the most recent posts, a query has to find the time of all the posts and only get the most recent 10. This can be made easy by the logic that the 10 most recent posts are the last 10 rows in the table. Any logic is also added, as in “the last 10 rows in the table viewable and for this guest/login.”

What if you want to track the last 10 people to view the post? Aha, this gets trickier. Convention would say that when a person views a post, have an SQL transaction that adds the information (person x viewed post y at time z and anyo other info, such as browser type, IP, etc) and if there are more than 10 entries for that post, delete the oldest ones until you have 10 entries. This transaction could be done via the application code or via triggers in MySQL 5.0 and up.

However, both those methods use multiple SQL queries, and in the case that an article has been viewed fewer than 10 times, the queries are unnecessary. And given each article has a different popularity — some are viewed lots more than others — running multiple queries ends up being a waste of cycles for articles whose last 10 viewers change infrequently.

These commands were tested on MySQL 4.1.19-standard-log. I use REPLACE INTO because it’s shorter than SELECT…ON DUPLICATE KEY UPDATE, and yes, those aren’t

Let’s say you have a New Year’s Resolution to eat 5 servings of fruits and 5 servings of vegetables per day. The only thing that changes from Baron’s example is that we add a group field (called ‘kind’). The “fruit” field was changed to “edible” and will still contain the name of the edible.

As Baron does, I will use a MySQL-specific command. However, he used SELECT...ON DUPLICATE KEY and I will use REPLACE, as it is smaller in syntax.

use test;
CREATE TABLE q (
id int NOT NULL,
modulo int NOT NULL,
kind char(1) NOT NULL,
food varchar(10) NOT NULL,
PRIMARY KEY(id,kind),
UNIQUE KEY(modulo,kind)
);

The basic statement is below — I’ve added AS clauses to make the variables more clear. The modulus is, in this case, 5, but in the article case above would be 10. The “kind” is either “f” or “v”, these are your groups of queues. In this case they stand for “fruits” and “vegetables” but they might be numbers referring to articles. The “food” stands for the type of food eaten, but in the article scenario would represent the username or user id of the customer viewing the article.

REPLACE INTO q (id, modulo, kind, food)
SELECT
(COALESCE(MAX(id), -1) + 1) AS id,
(COALESCE(MAX(id), -1) + 1) MOD 5 AS modulo,
'f' AS kind,
'apple' AS food
FROM q WHERE kind='f';

mysql> SELECT * FROM q order by kind,id;

id modulo kind food
0 0 f apple

As expected, 1 "fruit" row.

mysql> REPLACE INTO q(id, modulo, kind, food)
-> SELECT
-> (COALESCE(MAX(id), -1) + 1),
-> (COALESCE(MAX(id), -1) + 1) MOD 5,
-> 'f',
-> 'orange'
-> FROM q WHERE kind='f';
Query OK, 1 row affected (0.00 sec)
Records: 1 Duplicates: 0 Warnings: 0

mysql> SELECT * FROM q order by kind,id;

id modulo kind food
0 0 f apple
1 1 f orange

As expected, 2 "fruit" rows.


mysql> REPLACE INTO q(id, modulo, kind, food)
-> SELECT
-> (COALESCE(MAX(id), -1) + 1),
-> (COALESCE(MAX(id), -1) + 1) MOD 5,
-> 'v',
-> 'okra'
-> FROM q WHERE kind='v';
Query OK, 1 row affected (0.00 sec)
Records: 1 Duplicates: 0 Warnings: 0

mysql> SELECT * FROM q order by kind,id;

id modulo kind food
0 0 f apple
1 1 f orange
0 0 v okra

As expected, 2 "fruit" rows and 1 "vegetable" row. Now, let's quickly populate the fields so the "fruit" group reaches it's maximum of 5.

REPLACE INTO q(id, modulo, kind, food)
SELECT
(COALESCE(MAX(id), -1) + 1),
(COALESCE(MAX(id), -1) + 1) MOD 5,
'v',
'squash'
FROM q WHERE kind='v';

REPLACE INTO q(id, modulo, kind, food)
SELECT
(COALESCE(MAX(id), -1) + 1),
(COALESCE(MAX(id), -1) + 1) MOD 5,
'f',
'peach'
FROM q WHERE kind='f';

REPLACE INTO q(id, modulo, kind, food)
SELECT
(COALESCE(MAX(id), -1) + 1),
(COALESCE(MAX(id), -1) + 1) MOD 5,
'f',
'cherries'
FROM q WHERE kind='f';

REPLACE INTO q(id, modulo, kind, food)
SELECT
(COALESCE(MAX(id), -1) + 1),
(COALESCE(MAX(id), -1) + 1) MOD 5,
'f',
'pear'
FROM q WHERE kind='f';

REPLACE INTO q(id, modulo, kind, food)
SELECT
(COALESCE(MAX(id), -1) + 1),
(COALESCE(MAX(id), -1) + 1) MOD 5,
'v',
'celery'
FROM q WHERE kind='v';

SELECT * FROM q order by kind,id;

id modulo kind food
0 0 f apple
1 1 f orange
2 2 f peach
3 3 f cherries
4 4 f pear
0 0 v okra
1 1 v squash
2 2 v celery

We have 5 values in the “fruit” group and 3 values in the “veggie” group. Now let’s see what happens when another fruit is added:

REPLACE INTO q(id, modulo, kind, food)
SELECT
(COALESCE(MAX(id), -1) + 1),
(COALESCE(MAX(id), -1) + 1) MOD 5,
'f',
'banana'
FROM q WHERE kind='f';
Query OK, 2 rows affected (0.00 sec)
Records: 1 Duplicates: 1 Warnings: 0

Note that a duplicate has been found! This is because the modulo wrapped around. The id of “banana” is 5, and 5 modulo 5 = 0 – the same as 0 modulo 5, which was the modulo value previously taken by “apple”. So “apple” is pushed off the end of the queue.

SELECT * FROM q order by kind,id;

id modulo kind food
1 1 f orange
2 2 f peach
3 3 f cherries
4 4 f pear
0 5 f banana
0 0 v okra
1 1 v squash
2 2 v celery

To find the current list of all fruits, with the most recent fruit first, run:

SELECT * FROM q WHERE kind='f' ORDER BY id DESC;

id modulo kind food
1 1 f orange
2 2 f peach
3 3 f cherries
4 4 f pear
0 5 f banana

Let’s get back to the example of page views, though. We probably care about when the pages were viewed, so let’s add a timestamp:

ALTER TABLE q ADD COLUMN fed TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP;

I ran the queries again, with some delays, so the timestamps wouldn’t all be the same.

SELECT * FROM q order by id,kind;

id modulo kind food fed
1 1 f orange 2007-01-15 14:48:25
2 2 f peach 2007-01-15 14:48:28
3 3 f cherries 2007-01-15 14:48:28
4 4 f pear 2007-01-15 14:48:31
5 0 f banana 2007-01-15 14:48:34
1 1 v squash 2007-01-15 14:48:28
2 2 v celery 2007-01-15 14:48:31
3 3 v beet 2007-01-15 14:48:31
4 4 v spinach 2007-01-15 14:48:34
5 0 v cucumber 2007-01-15 14:48:34

Or, what the query would be in a real system — find all fruits eaten and sort by time, most recent first:

SELECT food,fed FROM q WHERE kind=’f’ ORDER BY fed DESC;

banana 2007-01-15 14:48:34
pear 2007-01-15 14:48:31
peach 2007-01-15 14:48:28
cherries 2007-01-15 14:48:28
orange 2007-01-15 14:48:25

edit:
One edge case to be aware of — reaching the limit of the id field.

If your application does 100 of these per second, an UNSIGNED INT will last 1.36 years — not so great. You should use an UNSIGNED INT because you’re never going to have a negative number, and a signed int would only last just over 8 months if there were 100 REPLACEs per second.

(60 seconds/min, 60 min/hr, 24 hrs/day, 365 days/yr)
4294967295/60/60/24/365/100=1.3619251950

So, for scaling/known high traffic, use a BIGINT. However, in this case, do NOT use UNSIGNED, as all MySQL arithmetic is done with signed bigints or doubles. Not that it matters in this case; at 100 REPLACEs per second, you will wrap at 2.9 billion years:

9223372036854775807/60/60/24/365/100=2,924,712,086.7753601074

Let’s say your system does 10,000 of these REPLACEs per second, for eternity (our main database system, where we’re about to use this, average 6,000 qps, not all writes, but it’s a good figure to use for our own numbers) — move the decimal places a few spots over and you’re down to running out of numbers in 29 million years.

That’s an OK limit for us. 🙂

I believe this is a huge optimization for a heavily implemented Web 2.0 idea.

This article makes simple work of groups of queues. An example of this would be “the most recent 10 people to view an article,” so each article has a queue of up to 10 items in it. This method eliminates the need for multiple SQL statements or using TRIGGERS to check to see if the queue is full.

I bow down to Baron Schwartz, aka Xarpb, for his article on how to implement a queue in SQL:

http://www.xaprb.com/blog/2007/01/11/how-to-implement-a-queue-in-sql/

I am very excited because this also works for groups of objects, and we’re about to implement something at work that needs this idea. The idea of “the most recent x things” or “the top x things” is huge, especially in social networking, and probably one of the most often sought after features.

The biggest issue is that in order to display, say, the most recent posts, a query has to find the time of all the posts and only get the most recent 10. This can be made easy by the logic that the 10 most recent posts are the last 10 rows in the table. Any logic is also added, as in “the last 10 rows in the table viewable and for this guest/login.”

What if you want to track the last 10 people to view the post? Aha, this gets trickier. Convention would say that when a person views a post, have an SQL transaction that adds the information (person x viewed post y at time z and anyo other info, such as browser type, IP, etc) and if there are more than 10 entries for that post, delete the oldest ones until you have 10 entries. This transaction could be done via the application code or via triggers in MySQL 5.0 and up.

However, both those methods use multiple SQL queries, and in the case that an article has been viewed fewer than 10 times, the queries are unnecessary. And given each article has a different popularity — some are viewed lots more than others — running multiple queries ends up being a waste of cycles for articles whose last 10 viewers change infrequently.

These commands were tested on MySQL 4.1.19-standard-log. I use REPLACE INTO because it’s shorter than SELECT…ON DUPLICATE KEY UPDATE, and yes, those aren’t

Let’s say you have a New Year’s Resolution to eat 5 servings of fruits and 5 servings of vegetables per day. The only thing that changes from Baron’s example is that we add a group field (called ‘kind’). The “fruit” field was changed to “edible” and will still contain the name of the edible.

As Baron does, I will use a MySQL-specific command. However, he used SELECT...ON DUPLICATE KEY and I will use REPLACE, as it is smaller in syntax.

use test;
CREATE TABLE q (
id int NOT NULL,
modulo int NOT NULL,
kind char(1) NOT NULL,
food varchar(10) NOT NULL,
PRIMARY KEY(id,kind),
UNIQUE KEY(modulo,kind)
);

The basic statement is below — I’ve added AS clauses to make the variables more clear. The modulus is, in this case, 5, but in the article case above would be 10. The “kind” is either “f” or “v”, these are your groups of queues. In this case they stand for “fruits” and “vegetables” but they might be numbers referring to articles. The “food” stands for the type of food eaten, but in the article scenario would represent the username or user id of the customer viewing the article.

REPLACE INTO q (id, modulo, kind, food)
SELECT
(COALESCE(MAX(id), -1) + 1) AS id,
(COALESCE(MAX(id), -1) + 1) MOD 5 AS modulo,
'f' AS kind,
'apple' AS food
FROM q WHERE kind='f';

Continue reading “MySQL Queues, part II — groups of queues”

MySQL GUIs — Navicat, MySQL Query Browser and phpMyAdmin

In the past few weeks I have been experimenting to see if a GUI interface to MySQL would make my life as a DBA any easier.

phpMyAdmin (available at http://www.phpmyadmin.net)

Navicat (available at http://www.navicat.com

MySQL Query Browser and MySQL Administrator (available at http://dev.mysql.com/downloads/gui-tools/5.0.html)

  • The Basics
  • Connecting
  • Navicat and MySQL Query browser both allow connecting and saving connections for future use. Both allow for saving the passwords for the database connection, although you have to specifcally tell MySQL Query Browser to do so. But MySQL Query Browser allows you to choose what format you’d like to save the passwords in — plaintext, obscured or OS specific.

    MySQL Query Browser lets you specify a default schema, and Navicat allows you to test a connection before saving.

    Navicat allows for all sorts of different proxies and options: SSH tunnel, HTTP tunnel, HTTP proxy, SSL and compression. Navicat allows saving passwords, for the proxies and tunnels as well as the database user.

  • Querying
  • Navicat has a “Design Query” tab as well as an “Edit Query” tab. In the “Design Query” tab, a user can design their query in the SELECT … FROM … WHERE … GROUP BY … HAVING … ORDER BY … LIMIT syntax. Everything is point and click, with menus and checkboxes to select tables, fields, functions, etc. This is useful for a lazy DBA or someone not familiar with the tables. I’ve found that it’s easier for me to type in the query, although being able to click a check box and scroll a menu for the correct field name is useful. You can even change the indentation if you want.

    It’s smart enough to take what you type in the query editor and translate it to the design editor, so if you want to do a hybrid of typing and selecting, you can. And the query editor tab shows syntax highlighting — strings in red, numbers in green, and MySQL commands in blue.

    Navicat allows you to save and load queries. Whenever you run a query, it shows you the explain output as well, which is a win over the commandline, where you’d have to type it out first. There’s no button to just do an explain, which would be nice, but you can just do it the old-fashioned way and type “EXPLAIN” at the beginning.

    Results can be exported in numerous different ways (see screenshot), with different delimiters and fields, and the option to include the field headers — the latter is something that SELECT…INTO OUTFILE cannot do.

    Three buttons allow you to see large text, hex and images properly — “memo”, “hex”, and “image”. These toggle a frame at the bottom (see screenshot) which show the different fields in their entirety — this is very handy for images stored in the database.

    I found Navicat extremely useful for giving database access to our Quality Control team, which deals with creating and enforcing data policies. The team was always asking me for one-off reports (ie, “can you send me a list of usernames of all users who haven’t logged in for 6 months?”), and they would then take that information and make an Access database of it. So they knew some SQL before using Navicat. I set them up with a proxy, saving the passwords so I did not have to create separate logins on the proxy or database servers. Now, they come to me and ask questions like, “Can you mark all users who haven’t logged in for 6 months for deletion?” — I gave them read-only access.

    Basically, for Navicat you really have to know SQL to do anything complex. But it’s great for folks just starting out, who want to learn more SQL, especially in a heterogeneous database environment, or somewhere they want centralized access.

  • Reporting/Scripting
  • The report feature on Navicat has Filemaker-style reports — this is great for a company that does not need an expensive tool like Crystal Reports but does not want to do all of their databases and reports in-house using a programming language. Reports in Navicat can be formatted to suit periodic reporting requirements in nice-looking reports that upper management will appreciate for the information and the look.

    Scripts in Navicat are very lacking — in that you can only run one MySQL command at a time. You can save multiple queries together but you cannot “run a script” with Navicat (at least, not that I found).

  • Data Manipulation (DML)
  • Both Navicat and MySQL allow you to write DML in the query editor. Navicat does not give you syntax for UPDATE nor DELETE. MySQL Query Browser has links in the bottom right corner that allow you to see the commands and then click through to offline manual pages. This “Inline Help” is very valuable for DBA’s who know approximately what they need but just need a little help with syntax.

    Both Navicat and MySQL Query Browser allow you to double-click on a result and change it, although with MySQL Query Browser you need to

  • Data Definition (DDL, Creating and Editing Schemas)
  • Navicat allows for changing table types, fields, indexes, foreign keys, character sets, collations, and table comments easily.

  • GRANTs
  • Navicat allows easy, clickable GRANT access. There’s not much to say about this, it’s full featured point-and-click, including complex user features such as max_queries_per_hour. Again, you need the knowledge that % is a wildcard to be able to fill in blocks correctly, a wizard to do that for you would be nice. Then again, it’s not a tutorial or replacement for knowledge, rather just a tool.

    MySQL Query Browser does not do user administration, it leaves that to the MySQL Administrator GUI.

  • Advanced
  • Navicat allows the creation of stored procedures and views. Views have a nice “builder” like the table builder, which is point and click, but the stored procedure does not help walk you through anything.

    With MySQL Query Browser you can right-click on a table to create a view or stored procedure, and then edit the template, which is basically a script, and you can do the same dragging and dropping that you can while building a query. Both Navicat and MySQL Query Browser allow you to see and edit stored procedures easily.

    For both tools, it would be neat if there were easy “insert cursor” features, or some kind of wizard to help create stored procedures and/or common functions.

In the past few weeks I have been experimenting to see if a GUI interface to MySQL would make my life as a DBA any easier.

phpMyAdmin (available at http://www.phpmyadmin.net)

Navicat (available at http://www.navicat.com

MySQL Query Browser and MySQL Administrator (available at http://dev.mysql.com/downloads/gui-tools/5.0.html)

  • The Basics
  • Connecting
  • Navicat and MySQL Query browser both allow connecting and saving connections for future use. Both allow for saving the passwords for the database connection, although you have to specifcally tell MySQL Query Browser to do so. But MySQL Query Browser allows you to choose what format you’d like to save the passwords in — plaintext, obscured or OS specific.

    MySQL Query Browser lets you specify a default schema, and Navicat allows you to test a connection before saving.

    Navicat allows for all sorts of different proxies and options: SSH tunnel, HTTP tunnel, HTTP proxy, SSL and compression. Navicat allows saving passwords, for the proxies and tunnels as well as the database user.

  • Querying
  • Navicat has a “Design Query” tab as well as an “Edit Query” tab. In the “Design Query” tab, a user can design their query in the SELECT … FROM … WHERE … GROUP BY … HAVING … ORDER BY … LIMIT syntax. Everything is point and click, with menus and checkboxes to select tables, fields, functions, etc. This is useful for a lazy DBA or someone not familiar with the tables. I’ve found that it’s easier for me to type in the query, although being able to click a check box and scroll a menu for the correct field name is useful. You can even change the indentation if you want.

    It’s smart enough to take what you type in the query editor and translate it to the design editor, so if you want to do a hybrid of typing and selecting, you can. And the query editor tab shows syntax highlighting — strings in red, numbers in green, and MySQL commands in blue.

    Navicat allows you to save and load queries. Whenever you run a query, it shows you the explain output as well, which is a win over the commandline, where you’d have to type it out first. There’s no button to just do an explain, which would be nice, but you can just do it the old-fashioned way and type “EXPLAIN” at the beginning.

    Results can be exported in numerous different ways (see screenshot), with different delimiters and fields, and the option to include the field headers — the latter is something that SELECT…INTO OUTFILE cannot do.

    Three buttons allow you to see large text, hex and images properly — “memo”, “hex”, and “image”. These toggle a frame at the bottom (see screenshot) which show the different fields in their entirety — this is very handy for images stored in the database.

    I found Navicat extremely useful for giving database access to our Quality Control team, which deals with creating and enforcing data policies. The team was always asking me for one-off reports (ie, “can you send me a list of usernames of all users who haven’t logged in for 6 months?”), and they would then take that information and make an Access database of it. So they knew some SQL before using Navicat. I set them up with a proxy, saving the passwords so I did not have to create separate logins on the proxy or database servers. Now, they come to me and ask questions like, “Can you mark all users who haven’t logged in for 6 months for deletion?” — I gave them read-only access.

    Basically, for Navicat you really have to know SQL to do anything complex. But it’s great for folks just starting out, who want to learn more SQL, especially in a heterogeneous database environment, or somewhere they want centralized access.

  • Reporting/Scripting
  • The report feature on Navicat has Filemaker-style reports — this is great for a company that does not need an expensive tool like Crystal Reports but does not want to do all of their databases and reports in-house using a programming language. Reports in Navicat can be formatted to suit periodic reporting requirements in nice-looking reports that upper management will appreciate for the information and the look.

    Scripts in Navicat are very lacking — in that you can only run one MySQL command at a time. You can save multiple queries together but you cannot “run a script” with Navicat (at least, not that I found).

  • Data Manipulation (DML)
  • Both Navicat and MySQL allow you to write DML in the query editor. Navicat does not give you syntax for UPDATE nor DELETE. MySQL Query Browser has links in the bottom right corner that allow you to see the commands and then click through to offline manual pages. This “Inline Help” is very valuable for DBA’s who know approximately what they need but just need a little help with syntax.

    Both Navicat and MySQL Query Browser allow you to double-click on a result and change it, although with MySQL Query Browser you need to

  • Data Definition (DDL, Creating and Editing Schemas)
  • Navicat allows for changing table types, fields, indexes, foreign keys, character sets, collations, and table comments easily.

  • GRANTs
  • Navicat allows easy, clickable GRANT access. There’s not much to say about this, it’s full featured point-and-click, including complex user features such as max_queries_per_hour. Again, you need the knowledge that % is a wildcard to be able to fill in blocks correctly, a wizard to do that for you would be nice. Then again, it’s not a tutorial or replacement for knowledge, rather just a tool.

    MySQL Query Browser does not do user administration, it leaves that to the MySQL Administrator GUI.

  • Advanced
  • Navicat allows the creation of stored procedures and views. Views have a nice “builder” like the table builder, which is point and click, but the stored procedure does not help walk you through anything.

    With MySQL Query Browser you can right-click on a table to create a view or stored procedure, and then edit the template, which is basically a script, and you can do the same dragging and dropping that you can while building a query. Both Navicat and MySQL Query Browser allow you to see and edit stored procedures easily.

    For both tools, it would be neat if there were easy “insert cursor” features, or some kind of wizard to help create stored procedures and/or common functions.

Google’s Open Source MySQL Tools

Chip Turner, been @ Google approximately 1.5 years.

http://code.google.com/p/google-mysql-tools

Right now you have to click “source” and then browse the subversion tree’s trunk.

These are tools written at Google that are frequently used, and perhaps useful for other folks.

So what’s there, how does Google use the code, and how can other folks use it?

mypgrep is to show and kill processes among many database servers. Perhaps from certain users, IP addresses, how often certain queries are being run, long-running queries, if it’s locked, connectionless queries — anything you’d use show processlist or show full processlist on.

Uses threads to talk to multiple MySQL databases in parallel to try to be fast, but it’s pretty standard Python code.

Connectionless queries are those that have been killed, but MySQL doesn’t check for the connection until it tries to send rows back — I know I’ve had this issue a lot, when I kill a query but it still runs, I’d like it to just die.

You can specify it to include or not include replication queries.

compact_innodb uses mysqldump and reload with checks to make sure that stuff doesn’t corrupt. This defragments InnoDB tables (and others). Done offline.

Dumps all the innodb tables, then drops the tables, stops mysql, deletes the innodb datafile and logs, and then restarts mysql and reload the dumps. They’ve found the defragmentation works well.

Can do only the dump or only the restore if you want.

It does your largest tables first, to maximize I/O bandwidth.

Why not ALTER TABLE? Takes about the same amount of time, so you can dump and restore to and from different machines. Seems to be more reliable, you can kill the process without worrying about data corruption.

Why not OPTIMIZE? If you use one innodb file per table and autoextend, you can reclaim the space used by the fragmentation. Of course, if you don’t use one innodb data file per table, and just have a hard limit, then it’s not going to reclaim space.

When is it time to defragment? One answer: Disk busy time vs. queries per second — when the disk starts to take a lot longer to seek with the same # of queries per second, it’s time to defragment.

compat_* are the compatibility libraries.

Interestingly, making the source open was easy — Google gave approval, so the work was mostly removing comments that were to-do lists for folks that no longer work there.

Other useful tools?
One way to speed up replication is to start some processes which will do all physical I/O that replication threads are about to do, so the stuff to be updated is already in the cache. If your slave has enough disk drives, and you can use iostat and see that the disks are 10% busy, it probably means that you’re I/O bound by the serialized nature of replication, and you can use this method to decrease the time. If your disks are 100% busy AND I/O bound, it won’t help. This will be on the page linked above sometime next week.

Basically, it reads the binary logs, trying not to get too far ahead, and issues the queries so the disk can cache the right data.

Chip Turner, been @ Google approximately 1.5 years.

http://code.google.com/p/google-mysql-tools

Right now you have to click “source” and then browse the subversion tree’s trunk.

These are tools written at Google that are frequently used, and perhaps useful for other folks.

So what’s there, how does Google use the code, and how can other folks use it?

mypgrep is to show and kill processes among many database servers. Perhaps from certain users, IP addresses, how often certain queries are being run, long-running queries, if it’s locked, connectionless queries — anything you’d use show processlist or show full processlist on.

Uses threads to talk to multiple MySQL databases in parallel to try to be fast, but it’s pretty standard Python code.

Connectionless queries are those that have been killed, but MySQL doesn’t check for the connection until it tries to send rows back — I know I’ve had this issue a lot, when I kill a query but it still runs, I’d like it to just die.

You can specify it to include or not include replication queries.

compact_innodb uses mysqldump and reload with checks to make sure that stuff doesn’t corrupt. This defragments InnoDB tables (and others). Done offline.

Dumps all the innodb tables, then drops the tables, stops mysql, deletes the innodb datafile and logs, and then restarts mysql and reload the dumps. They’ve found the defragmentation works well.

Can do only the dump or only the restore if you want.

It does your largest tables first, to maximize I/O bandwidth.

Why not ALTER TABLE? Takes about the same amount of time, so you can dump and restore to and from different machines. Seems to be more reliable, you can kill the process without worrying about data corruption.

Why not OPTIMIZE? If you use one innodb file per table and autoextend, you can reclaim the space used by the fragmentation. Of course, if you don’t use one innodb data file per table, and just have a hard limit, then it’s not going to reclaim space.

When is it time to defragment? One answer: Disk busy time vs. queries per second — when the disk starts to take a lot longer to seek with the same # of queries per second, it’s time to defragment.

compat_* are the compatibility libraries.

Interestingly, making the source open was easy — Google gave approval, so the work was mostly removing comments that were to-do lists for folks that no longer work there.

Other useful tools?
One way to speed up replication is to start some processes which will do all physical I/O that replication threads are about to do, so the stuff to be updated is already in the cache. If your slave has enough disk drives, and you can use iostat and see that the disks are 10% busy, it probably means that you’re I/O bound by the serialized nature of replication, and you can use this method to decrease the time. If your disks are 100% busy AND I/O bound, it won’t help. This will be on the page linked above sometime next week.

Basically, it reads the binary logs, trying not to get too far ahead, and issues the queries so the disk can cache the right data.