Real Password Security

With recent posts by Frank Mash and Stewart Smith about password protecting, I am reminded of all the privacy vs. security arguments we have going on in the United States. Basically, I see a somewhat similar situation — how much privacy do folks give up for the sake of security is analogous to how much calculation, how many hoops to jump through, to ensure that data is secured properly.

On the one hand, the analogy falls apart, because encryption calculation times are much less of an “inconvenience” than an invasion of privacy, and thus the argument gets usurped. It’s just a function, or a few calculations, no big deal. We all use SSH instead of telnet, and hopefully SFTP instead of FTP, because plaintext passwords are bad.

As a retort, most folks do not use SSL-enabled MySQL, and some do not even use SSL-enabled http. Why? Because it’s slow! Well, we do not want anything slow! But the security is worth the slowness! What? You mean people will go to another web site if yours is too slow? But the competitor is not as secure!!!!! So the analogy works there.

The analogy also works, when you consider how valuable the data is that you are attempting to lock up. Financial and health institutions need as high a level of encryption as possible for passwords, and any organization that stores a federal ID number should encrypt that. Nobody wants their privacy violated by having their health information exposed, their identity stolen, their bank accounts drained and their credit cards maxxed out.

Many people worry about things that are basically public information. For instance, in the US, the bank account number and the routing number are both printed on every check. Electronic Fund Transfers (EFTs) require three numbers — the checking account number, the bank routing number, and the amount. Or at least, that is all I type in. Perhaps my account stores address information and it is checked against that, but I am not asked for my billing address, as I am with a credit card transaction.

Some people guard their bank account number with extreme privacy, but it is in plain sight on the bottom of every paper check written!

Similarly, does it matter if someone cracks my password in some applications? How much damage could someone do if they got my password to a newspaper site. What are they going to do, write a letter to the editor or a comment? You’re not liable if someone cracks your server and then uses it to port-scan government agencies. It’s just a pain when you have to wipe your OS and start over. But no valuable information is lost, just time and patience.

At work, sure, they could get their hands on trade secrets if they cracked my desktop, laptop, VPN, or email password.

What about a dating site? What about a community forum? Should my password on forums.mysql.com be stored as tightly as my password on mysqlcamp.org? What does it matter if either password is cracked? Sure, if they try that same password on paypal, and I am dumb enough to use the same password for important data that I do for non-important data, my password will be stolen.

This is, to me, one of the greatest things about wikis. Sure, people have vandalized wikis, but it’s much more satisfying for folks to vandalize a site that’s not “open”. Someone is going to steal a new $3,000 bicycle that’s not locked up, but nobody is going to touch the old beater with a rusty frame, missing front wheel, flat back tire.

And of course, your application probably falls somewhere in between “everyone wants it” and “nobody wants it”. One of the things I say over and over in the presentations I give is “think about it.” Think about the security you need. Do a risk analysis. If you want your data secure, write it on a piece of paper and have recipients swallow it after they’ve read it. For any other security method, think about the gaps, and think about what really matters.

As a user, think about the ramifications of your passwords, as well. Many sites without “important” information will e-mail your password if you forget it. There it is, in plaintext for the world to intercept. I keep a few passwords at any given time (and change them every so often) — “really secure” ones, for financial institutions and such, “somewhat secure” ones, for things like blogging sites, and then “throwaway” ones, for sites where the info is not important, and I would suffer very little if my password is “cracked”.

—————————-
As well, by highlighting the encryption functions (MD5() and SHA1()) in MySQL, both articles imply that applications should call the encryption functions within MySQL. If an application is using encryption at all, it should be done as close to the user as possible. A client-side encryption such as a Javascript function is much better, security-wise, than using MySQL. You want to encrypt it *before* it goes over the network. If someone’s sniffing the network, then running SELECT nickname FROM myUser WHERE username='sheeri' and password=SHA1('kritzer'); is pointless — even if you salt the data. Someone can sniff the packet and find the plaintext ‘kritzer’ — either between the client’s browser and the web server, or between the web server and the database server.

Stewart’s post did not mention that a JOIN is actually the best way to go — JOIN the words table with the passwords on the salted hash, and then you can possibly retrieve a password. As well, if I were a cracker, I wouldn’t care about using a VIEW, I’d just add a “saltedmd5” column to my table, index it, and then JOIN the tables (creating an index on the table field if need be). Because he was comparing md5 sums, not actually trying to compare passwords through an application, it means he had a backdoor to make database calls, so we could indeed assume a JOIN is possible.

With recent posts by Frank Mash and Stewart Smith about password protecting, I am reminded of all the privacy vs. security arguments we have going on in the United States. Basically, I see a somewhat similar situation — how much privacy do folks give up for the sake of security is analogous to how much calculation, how many hoops to jump through, to ensure that data is secured properly.

On the one hand, the analogy falls apart, because encryption calculation times are much less of an “inconvenience” than an invasion of privacy, and thus the argument gets usurped. It’s just a function, or a few calculations, no big deal. We all use SSH instead of telnet, and hopefully SFTP instead of FTP, because plaintext passwords are bad.

As a retort, most folks do not use SSL-enabled MySQL, and some do not even use SSL-enabled http. Why? Because it’s slow! Well, we do not want anything slow! But the security is worth the slowness! What? You mean people will go to another web site if yours is too slow? But the competitor is not as secure!!!!! So the analogy works there.

The analogy also works, when you consider how valuable the data is that you are attempting to lock up. Financial and health institutions need as high a level of encryption as possible for passwords, and any organization that stores a federal ID number should encrypt that. Nobody wants their privacy violated by having their health information exposed, their identity stolen, their bank accounts drained and their credit cards maxxed out.

Many people worry about things that are basically public information. For instance, in the US, the bank account number and the routing number are both printed on every check. Electronic Fund Transfers (EFTs) require three numbers — the checking account number, the bank routing number, and the amount. Or at least, that is all I type in. Perhaps my account stores address information and it is checked against that, but I am not asked for my billing address, as I am with a credit card transaction.

Some people guard their bank account number with extreme privacy, but it is in plain sight on the bottom of every paper check written!

Similarly, does it matter if someone cracks my password in some applications? How much damage could someone do if they got my password to a newspaper site. What are they going to do, write a letter to the editor or a comment? You’re not liable if someone cracks your server and then uses it to port-scan government agencies. It’s just a pain when you have to wipe your OS and start over. But no valuable information is lost, just time and patience.

At work, sure, they could get their hands on trade secrets if they cracked my desktop, laptop, VPN, or email password.

What about a dating site? What about a community forum? Should my password on forums.mysql.com be stored as tightly as my password on mysqlcamp.org? What does it matter if either password is cracked? Sure, if they try that same password on paypal, and I am dumb enough to use the same password for important data that I do for non-important data, my password will be stolen.

This is, to me, one of the greatest things about wikis. Sure, people have vandalized wikis, but it’s much more satisfying for folks to vandalize a site that’s not “open”. Someone is going to steal a new $3,000 bicycle that’s not locked up, but nobody is going to touch the old beater with a rusty frame, missing front wheel, flat back tire.

And of course, your application probably falls somewhere in between “everyone wants it” and “nobody wants it”. One of the things I say over and over in the presentations I give is “think about it.” Think about the security you need. Do a risk analysis. If you want your data secure, write it on a piece of paper and have recipients swallow it after they’ve read it. For any other security method, think about the gaps, and think about what really matters.

As a user, think about the ramifications of your passwords, as well. Many sites without “important” information will e-mail your password if you forget it. There it is, in plaintext for the world to intercept. I keep a few passwords at any given time (and change them every so often) — “really secure” ones, for financial institutions and such, “somewhat secure” ones, for things like blogging sites, and then “throwaway” ones, for sites where the info is not important, and I would suffer very little if my password is “cracked”.

—————————-
As well, by highlighting the encryption functions (MD5() and SHA1()) in MySQL, both articles imply that applications should call the encryption functions within MySQL. If an application is using encryption at all, it should be done as close to the user as possible. A client-side encryption such as a Javascript function is much better, security-wise, than using MySQL. You want to encrypt it *before* it goes over the network. If someone’s sniffing the network, then running SELECT nickname FROM myUser WHERE username='sheeri' and password=SHA1('kritzer'); is pointless — even if you salt the data. Someone can sniff the packet and find the plaintext ‘kritzer’ — either between the client’s browser and the web server, or between the web server and the database server.

Stewart’s post did not mention that a JOIN is actually the best way to go — JOIN the words table with the passwords on the salted hash, and then you can possibly retrieve a password. As well, if I were a cracker, I wouldn’t care about using a VIEW, I’d just add a “saltedmd5” column to my table, index it, and then JOIN the tables (creating an index on the table field if need be). Because he was comparing md5 sums, not actually trying to compare passwords through an application, it means he had a backdoor to make database calls, so we could indeed assume a JOIN is possible.

MySQLCamp, Here I Come!

On Wednesday night, I did some consulting, and it ended up taking twice as long as I thought it would. So I am rewarding myself by going to MySQLCamp!

Speaking of which, I updated the home page, adding explicitly that registration is free, and a section on travel information. I have no idea how housing is being organized, or if it is, and I am happy to take the lead on doing so.

The basics are that there are some good, cheap* 3-star hotels not too far away. Cheap = under $100, I even found some in the $60 price range! I would love to get a sense of what folks are doing for lodging, and if folks want, I can work on getting a group discount (some hotels will arrange one for a minimum of 10 rooms), arranging a suite for the “evening track”, etc.

Currently stating that you’re interested does not require a commitment to get a hotel room. If we have critical mass, I can see what the options are, and folks can reserve a room for themselves or, if it’s easier/cheaper, I can make the reservations for folks.

Alternatively, if someone speaks up and says, “Silly Sheeri! It’s all taken care of!” please point me in the right direction.

Public transit information would be great on the travel page, particularly from the airport to the venue. Also, knowing what time camp starts on Friday and ends on Monday would be great…..sure, they’re approximate….

* the hotels are good and cheap, therefore they cannot be fast.

On Wednesday night, I did some consulting, and it ended up taking twice as long as I thought it would. So I am rewarding myself by going to MySQLCamp!

Speaking of which, I updated the home page, adding explicitly that registration is free, and a section on travel information. I have no idea how housing is being organized, or if it is, and I am happy to take the lead on doing so.

The basics are that there are some good, cheap* 3-star hotels not too far away. Cheap = under $100, I even found some in the $60 price range! I would love to get a sense of what folks are doing for lodging, and if folks want, I can work on getting a group discount (some hotels will arrange one for a minimum of 10 rooms), arranging a suite for the “evening track”, etc.

Currently stating that you’re interested does not require a commitment to get a hotel room. If we have critical mass, I can see what the options are, and folks can reserve a room for themselves or, if it’s easier/cheaper, I can make the reservations for folks.

Alternatively, if someone speaks up and says, “Silly Sheeri! It’s all taken care of!” please point me in the right direction.

Public transit information would be great on the travel page, particularly from the airport to the venue. Also, knowing what time camp starts on Friday and ends on Monday would be great…..sure, they’re approximate….

* the hotels are good and cheap, therefore they cannot be fast.

Awkward JDBC API and MySQL User Group Meeting

Life has been super busy, but I have plenty of posting to catch up on. Fear not, there will be more content soon (after Monday, hopefully); I am working on editing a very relevant book, and I hope to be able to share my excitement after I finish.

Also on Monday is the next Boston MySQL User Group, which will go through how to do capacity planning in MySQL with very little pain and effort. In the beginning we will have 10 minutes for user questions, so we can all benefit from each other’s knowledge. I already have a user with a great question!

We have rebuilding our product all summer, with a deadline of releasing the product in the next 2 months. Our lead developer had put a surrogate key in our new schema about a month ago. He said he needed the surrogate key because “the ORM requires it.” I just accepted it.

My mistake was that I made an assumption. The table already had a primary key, but it was a composite key. I assumed that the ORM required a unique key to be one field, and thus I assumed he combined the 2 values in the primary key to get the unique key.

I was wrong. In adding the final subsystems into the schema this week, I noticed that the surrogate key was an auto-increment field. I also noticed he put it in every single table he dealt with. We had hours of meetings about schema, and this was NEVER put in as a requirement. Thus, today we had over three hours of agonizing discussions, including a back-and-forth of “If it’s going into the database I need to understand why,” and the lead developer saying “No you don’t.”

I severely wanted to say “If I don’t understand it, it does not go in the database,” but resisted. I asked him to send me the ORM so I could understand why it required the field. At first he said he would, and then kept talking to me about why I don’t need to understand the field; it didn’t add more overhead, it didn’t change the way the data relate to each other, etc.

I need to understand this because there may be other, similar requirements in the future. Or perhaps I’ll find a better way to do it (maybe a VIEW would work). Perhaps I’ll find other places where other fields need to be added. He finally explained that the API JDBC was using was awkward — it needs to retrieve basically the row number of any row it’s looking at, and if it deletes or changes the row number it uses the row number as the key to find the row.

Aha! That makes sense. However, why do the row numbers need to be in the database? Can’t it just retrieve the rows and put a row number in its own copy? Apparently, not. I cannot imagine that a mature technology would require something like that. It’s not that difficult to do. I said this, and the lead developer was insanely frustrated by it.

So I said, “Are you using Connector/J?” He was confused, but asked, “Is that JDBC?”

“Yes,” I replied. “Oh, then yes, we’re using it.”

“I don’t think so. If the interface is awkward, you’re not using Connector/J.”

He left my office. So I type in “Connector/J” into my MySQL Community Toolbar (I love it!) and find the following on http://www.mysql.com/products/connector/j/

New features from the JDBC-3.0 API in the latest production version of MySQL Connector/J include getGeneratedKeys which allows users to retrieve auto-increment fields in a non-database-specific way. Auto-increment fields now work with object-relational mapping tools, as well as Enterprise Java Beans (EJB) servers with Container Managed Persistence (CMP) that support JDBC-3.0.

Hrm….retrieve auto-increment fields in a non-database-specific way? I think that solves our problem!!!

[EDIT: I am, apparently wrong….but I cannot imagine that anyone using JDBC specifies an auto-increment field for EVERY SINGLE TABLE their application will touch. Do people actually do this?!?!?]

Life has been super busy, but I have plenty of posting to catch up on. Fear not, there will be more content soon (after Monday, hopefully); I am working on editing a very relevant book, and I hope to be able to share my excitement after I finish.

Also on Monday is the next Boston MySQL User Group, which will go through how to do capacity planning in MySQL with very little pain and effort. In the beginning we will have 10 minutes for user questions, so we can all benefit from each other’s knowledge. I already have a user with a great question!

We have rebuilding our product all summer, with a deadline of releasing the product in the next 2 months. Our lead developer had put a surrogate key in our new schema about a month ago. He said he needed the surrogate key because “the ORM requires it.” I just accepted it.

My mistake was that I made an assumption. The table already had a primary key, but it was a composite key. I assumed that the ORM required a unique key to be one field, and thus I assumed he combined the 2 values in the primary key to get the unique key.

I was wrong. In adding the final subsystems into the schema this week, I noticed that the surrogate key was an auto-increment field. I also noticed he put it in every single table he dealt with. We had hours of meetings about schema, and this was NEVER put in as a requirement. Thus, today we had over three hours of agonizing discussions, including a back-and-forth of “If it’s going into the database I need to understand why,” and the lead developer saying “No you don’t.”

I severely wanted to say “If I don’t understand it, it does not go in the database,” but resisted. I asked him to send me the ORM so I could understand why it required the field. At first he said he would, and then kept talking to me about why I don’t need to understand the field; it didn’t add more overhead, it didn’t change the way the data relate to each other, etc.

I need to understand this because there may be other, similar requirements in the future. Or perhaps I’ll find a better way to do it (maybe a VIEW would work). Perhaps I’ll find other places where other fields need to be added. He finally explained that the API JDBC was using was awkward — it needs to retrieve basically the row number of any row it’s looking at, and if it deletes or changes the row number it uses the row number as the key to find the row.

Aha! That makes sense. However, why do the row numbers need to be in the database? Can’t it just retrieve the rows and put a row number in its own copy? Apparently, not. I cannot imagine that a mature technology would require something like that. It’s not that difficult to do. I said this, and the lead developer was insanely frustrated by it.

So I said, “Are you using Connector/J?” He was confused, but asked, “Is that JDBC?”

“Yes,” I replied. “Oh, then yes, we’re using it.”

“I don’t think so. If the interface is awkward, you’re not using Connector/J.”

He left my office. So I type in “Connector/J” into my MySQL Community Toolbar (I love it!) and find the following on http://www.mysql.com/products/connector/j/

New features from the JDBC-3.0 API in the latest production version of MySQL Connector/J include getGeneratedKeys which allows users to retrieve auto-increment fields in a non-database-specific way. Auto-increment fields now work with object-relational mapping tools, as well as Enterprise Java Beans (EJB) servers with Container Managed Persistence (CMP) that support JDBC-3.0.

Hrm….retrieve auto-increment fields in a non-database-specific way? I think that solves our problem!!!

[EDIT: I am, apparently wrong….but I cannot imagine that anyone using JDBC specifies an auto-increment field for EVERY SINGLE TABLE their application will touch. Do people actually do this?!?!?]

Building “Sane” Query Interfaces

OSCON 2006 workshop

Building “Sane” Query Interfaces, by Mark Newsome e-mail is newsome@acm.org

Insane query interfaces have no help, no format, no range, and no help on how many results will be returned.

Better — pull-downs with a descriptive default like “select state”. Find min and max date from db. Gives a count for how many items in the pulldown menu. UI automatically reverts to use a popup button when pulldown gets too long.

“Refine” button can be pressed when the form is filled, and it will get a count, so you can “preview” your results. When the “refine” button is pressed, it refines EVERYTHING — ie, the counts for the range for the other items. Very smart! (see the song database demo below, when you enter in a lyrics search and click refine, it repopulates the pulldown menus with only the artists and titles that have the lyrics you requested). Also, if there’s only one choice after refining a pulldown,

Also, once the sort comes back, clicking the header field will sort the data by that field.

This is a great idea, even for things that aren’t at heart a db query — for instance, an interface to a “broadcast e-mail” could instruct a user to select who to send to (whole list, one location, one interest, whatever) , and you can

Framework:

QueryParameterBean
make a query parameter “bean” with getters and setters for parameters. If your form has 20 parameters, just set the ones that the user set. Also, includes methods for determining whether a field is “empty” and if “empty” means 0 or NULL or ”.
Sample classes — getCompanyNameCount(), getCompanyName(), getOrderCount(), getOrders() — they all call the “master query method”.

QueryMethod(selectClause,groupClause,sortClause,parameterBean) — builds dynamic query based on the parameters that were set — if you’ve only set the company name count, you’ll set the select query to count(*) or whatever. Empty parameters don’t participate in the query.

Results are wrapped to free resultSet — into a ResultsWrapper.

How to populate a pull-down on the query form — set parameters from form, call the getFoo() functions, then QueryMethod to build the query, and then call a ResultsFormatter to generate an HTML pulldown (or input field/popup button combo, depending on field size (settable)).

But this is tedious — lots of coding, as opposed to a simple form.
Uses more DB resources for all those extra queries.
But, database are fast….also, you could create the forms dynamically, once a day, instead of dynamically every time the page is loaded.

Other things to do:
Validate form with Javascript to avoid wasteful roundtrips to the server
Use JS popups to assist the user (ie, date selection)
Useful to show cardinality of each item, ie, if there are 6 company names and 50,000 records, you might just want to pick a company name and then hit Refine to see if you have a manageable set. If not, you can go to another field — ie, “browsing by queries”.

Accept * wildcards, but convert to % for the db.
Convert text fields to upper or lower case for comparison (or use a case-insensitive search, which MySQL has — just don’t use BINARY).
Use a text box for “too many” values for a pulldown
Also, use a trigger on insert to update values in a lookup table (ie, # of users, when a user is added, just increment a count). Or write a function that regularly makes this table by doing a count(*), but not every time.

Paged format for results — run the query, but only select the row ids (or primary key with MySQL, since not every row has an externally visible ID). Cache the ids, and retrieve the row ids for each page. (with MySQL you could use limits, but you have to be careful about running the same query at different times — if a new row is inserted, then your limit is slightly off)

Alternatively you could use cursors for paged formats, but that way is better.

Benefits:
You can drill down large datasets without leaving the form. Improves efficiency and reduces frustration.

The db does more work, but that’s its job, and it makes the overall user time lower (ie, db load is higher, but it means the db does work instead of the user).

http://www.capitolvelo.com/newsome/music.php has a technique like this for a song database, written in php. The code for that page.

Future work — automatically generate refinable and customizable query interfaces. Goals: SQL and code-free for users, users can specify the query graphically (using an ER diagram or layout of db), and tool generates the experience.

Now here’s where MySQL could really come in handy. One of my HUGE pet peeves is I’ll create a database and then end up wanting to make a bunch of forms for it — search, edit, insert new, etc. You could write code to go through all the tables in one or more dbs (or in all dbs), do a PROCEDURE ANALYSE(), and create forms based on the info in there (ie, if ENUM is recommended, use a pull-down menu, if char or varchar is recommended, use a text box, etc).

http://www.capitolvelo.com/newsome/index.html shows his writings.

(he’s done this in Java and PHP. He uses prepared statements in Java, which eliminates worries about quotes in data. If you’re building an SQL string, you have to worry about that).

OSCON 2006 workshop

Building “Sane” Query Interfaces, by Mark Newsome e-mail is newsome@acm.org

Insane query interfaces have no help, no format, no range, and no help on how many results will be returned.

Better — pull-downs with a descriptive default like “select state”. Find min and max date from db. Gives a count for how many items in the pulldown menu. UI automatically reverts to use a popup button when pulldown gets too long.

“Refine” button can be pressed when the form is filled, and it will get a count, so you can “preview” your results. When the “refine” button is pressed, it refines EVERYTHING — ie, the counts for the range for the other items. Very smart! (see the song database demo below, when you enter in a lyrics search and click refine, it repopulates the pulldown menus with only the artists and titles that have the lyrics you requested). Also, if there’s only one choice after refining a pulldown,

Also, once the sort comes back, clicking the header field will sort the data by that field.

This is a great idea, even for things that aren’t at heart a db query — for instance, an interface to a “broadcast e-mail” could instruct a user to select who to send to (whole list, one location, one interest, whatever) , and you can

Framework:

QueryParameterBean
make a query parameter “bean” with getters and setters for parameters. If your form has 20 parameters, just set the ones that the user set. Also, includes methods for determining whether a field is “empty” and if “empty” means 0 or NULL or ”.
Sample classes — getCompanyNameCount(), getCompanyName(), getOrderCount(), getOrders() — they all call the “master query method”.

QueryMethod(selectClause,groupClause,sortClause,parameterBean) — builds dynamic query based on the parameters that were set — if you’ve only set the company name count, you’ll set the select query to count(*) or whatever. Empty parameters don’t participate in the query.

Results are wrapped to free resultSet — into a ResultsWrapper.

How to populate a pull-down on the query form — set parameters from form, call the getFoo() functions, then QueryMethod to build the query, and then call a ResultsFormatter to generate an HTML pulldown (or input field/popup button combo, depending on field size (settable)).

But this is tedious — lots of coding, as opposed to a simple form.
Uses more DB resources for all those extra queries.
But, database are fast….also, you could create the forms dynamically, once a day, instead of dynamically every time the page is loaded.

Other things to do:
Validate form with Javascript to avoid wasteful roundtrips to the server
Use JS popups to assist the user (ie, date selection)
Useful to show cardinality of each item, ie, if there are 6 company names and 50,000 records, you might just want to pick a company name and then hit Refine to see if you have a manageable set. If not, you can go to another field — ie, “browsing by queries”.

Accept * wildcards, but convert to % for the db.
Convert text fields to upper or lower case for comparison (or use a case-insensitive search, which MySQL has — just don’t use BINARY).
Use a text box for “too many” values for a pulldown
Also, use a trigger on insert to update values in a lookup table (ie, # of users, when a user is added, just increment a count). Or write a function that regularly makes this table by doing a count(*), but not every time.

Paged format for results — run the query, but only select the row ids (or primary key with MySQL, since not every row has an externally visible ID). Cache the ids, and retrieve the row ids for each page. (with MySQL you could use limits, but you have to be careful about running the same query at different times — if a new row is inserted, then your limit is slightly off)

Alternatively you could use cursors for paged formats, but that way is better.

Benefits:
You can drill down large datasets without leaving the form. Improves efficiency and reduces frustration.

The db does more work, but that’s its job, and it makes the overall user time lower (ie, db load is higher, but it means the db does work instead of the user).

http://www.capitolvelo.com/newsome/music.php has a technique like this for a song database, written in php. The code for that page.

Future work — automatically generate refinable and customizable query interfaces. Goals: SQL and code-free for users, users can specify the query graphically (using an ER diagram or layout of db), and tool generates the experience.

Now here’s where MySQL could really come in handy. One of my HUGE pet peeves is I’ll create a database and then end up wanting to make a bunch of forms for it — search, edit, insert new, etc. You could write code to go through all the tables in one or more dbs (or in all dbs), do a PROCEDURE ANALYSE(), and create forms based on the info in there (ie, if ENUM is recommended, use a pull-down menu, if char or varchar is recommended, use a text box, etc).

http://www.capitolvelo.com/newsome/index.html shows his writings.

(he’s done this in Java and PHP. He uses prepared statements in Java, which eliminates worries about quotes in data. If you’re building an SQL string, you have to worry about that).

Embedding a Database in the Browser

OSCON 2006 workshop report

Embedding a Database in the Browser

David Van Couvering
Database Technology GroupSun Microsystems, Inc.

Why would you want to embed a database in the browser? Well, for one, the mobile user isn’t always connected. Keeping personal data off the server (and keep it in your control). And then there’s the fast local web cache — for more performance, store the data on the client’s browser.

What you need to make this work:

cross-browser support and feature-rich language (Java), embeddable, small footprint, standards-compliant, secure, robust = automatic crash recovery (people are always closing their browser).

So why use a database? Why not a cookie or other text?
standard, portable API
ACID semantics, even if there’s multiple threads
Flexible data model
Powerful query capabilities
Works with lots of existing tools, using existing skills. nothing new to learn.

Apache Derby fits into this — 100% Java Open Source relational DB — http://db.apache.org/derby — embeddable, small footprint, standards-compliant, secure, automatic crash recovery.

Not quite AJAX, more like LJAX — local javascript and xml, not asynchronous because you’re not synching to the server. You could also use this technology to embed a web server in a browser.

He called these embedded things “applets” which makes me cringe….ie, “let’s use servlets offline, using applets!” Didn’t we already try this?

create the db, put the directory into the application’s JAR file, and then you copy the JAR to the local filesystem — for which you need to sign the JAR and wrap it within PrivilegedAction.

(he talked about security — if someone steals your laptop, won’t they access the info on your browser?)

It’s basically a geek hack. I’d wanted it to basically be “how to supercharge cookies” and use features already in most browsers to enable an embedded database, therefore making apps much faster. Sadly, not the case. A neat technology nonetheless.

OSCON 2006 workshop report

Embedding a Database in the Browser

David Van Couvering
Database Technology GroupSun Microsystems, Inc.

Why would you want to embed a database in the browser? Well, for one, the mobile user isn’t always connected. Keeping personal data off the server (and keep it in your control). And then there’s the fast local web cache — for more performance, store the data on the client’s browser.

What you need to make this work:

cross-browser support and feature-rich language (Java), embeddable, small footprint, standards-compliant, secure, robust = automatic crash recovery (people are always closing their browser).

So why use a database? Why not a cookie or other text?
standard, portable API
ACID semantics, even if there’s multiple threads
Flexible data model
Powerful query capabilities
Works with lots of existing tools, using existing skills. nothing new to learn.

Apache Derby fits into this — 100% Java Open Source relational DB — http://db.apache.org/derby — embeddable, small footprint, standards-compliant, secure, automatic crash recovery.

Not quite AJAX, more like LJAX — local javascript and xml, not asynchronous because you’re not synching to the server. You could also use this technology to embed a web server in a browser.

He called these embedded things “applets” which makes me cringe….ie, “let’s use servlets offline, using applets!” Didn’t we already try this?

create the db, put the directory into the application’s JAR file, and then you copy the JAR to the local filesystem — for which you need to sign the JAR and wrap it within PrivilegedAction.

(he talked about security — if someone steals your laptop, won’t they access the info on your browser?)

It’s basically a geek hack. I’d wanted it to basically be “how to supercharge cookies” and use features already in most browsers to enable an embedded database, therefore making apps much faster. Sadly, not the case. A neat technology nonetheless.

Partial Outer Join

I was wracking my brain for a bit, and could not figure out how to do a partial outer join where I was limiting rows from one table. So, I went to post to the MySQL user list — in writing my post, I found the answer. This often happens. It’s not that I don’t have questions to ask the MySQL user list (or forums), it’s just that most of the time, while I’m writing up the question, I find my answer.

First, I simplified my question. Many people do this, but they leave out important information. The trick is to come up with a simplification that you test to make sure it breaks in the same way the more complex, “real” query breaks.

The problem is basically this:
I want an outer join of a static table with a data table, such that I get a report of how many rows in the data table match the static table — including “0” if there are none, but for a subset of the data in the data table. Consider a billing report of how many invoices were generated per date, for a specified time period.

This should be easy, right? It was not very intuitive for me.

In this system, I’m using MySQL 4.1.19-standard-log so stored procedures aren’t an option. Also, I’d like to avoid a loop — the standard way of doing this (the “developer” way) is to get all the values in the static table, do a count of rows in the dynamic table for each row in the static table, and then display. However, I wanted to do it in very few queries, and so that the number of queries did not depend on the static data. I call this doing it the “DBA way”.

First, the setup:

use test;

CREATE TABLE `dynamic` (
`id` int(10) unsigned NOT NULL auto_increment,
`joinme` char(1) default NULL,
PRIMARY KEY (`id`));

CREATE TABLE `static`
( `joinme` char(1) default NULL,
`name` char(5) default NULL);

insert into static (joinme) VALUES('a'),('b'),('c'),('d'),
('e'),('f'),('g'),('h'),('i'),
('j'),('k'),('l'),('m'),('n'),
('o'),('p'),('q'),('r'),('s'),
('t'),('u'),('v'),('w'),('x'),
('y'),('z');

update static set name=repeat(joinme,5);

insert into dynamic (joinme) select joinme from static order by rand() limit 10;
insert into dynamic (joinme) select joinme from static order by rand() limit 10;
insert into dynamic (joinme) select joinme from static order by rand() limit 10;
insert into dynamic (joinme) select joinme from static order by rand() limit 10;
insert into dynamic (joinme) select joinme from static order by rand() limit 10;

First, a test to make sure everything’s OK:
# The following will give you the random distribution you have just made:
select count(id), joinme
from dynamic
group by joinme order by joinme;

# the result is < = 26 rows, depending on the random distribution.

# the following will show the distribution of all 26 names (aaaaa through zzzzz) represented in the dynamic table (well, all 26 joinme values, really, but that's irrelevant to the point)
SELECT s.name,COUNT(d.joinme)
FROM static s LEFT JOIN dynamic d USING (joinme)
GROUP BY name;

# and indeed, it gets the expected 26 rows, with 0 filled in for the count when there's no match in the data table.

So far, so good. Remember, I don’t want to join all of the entries in the dynamic table….because it could be millions of entries, and maybe I want to limit it to id’s of a certain number. So what if I want the distribution of names given the first 15 id’s?

Using:
SELECT s.name,COUNT(d.joinme)
FROM static s LEFT JOIN dynamic d USING (joinme)
WHERE d.id<15 group by name;

gives me the distribution, but only for the values that among those ids. So in one test, I get 13 rows, all with a count>0, instead of 26 rows, with some of them having a count of 0 and others having a count>0.

This breaks in the same way my more complex query.

How can I limit the values from one table and outer join them to another, retaining everything? This seems like a very simple outer join, but because I want to put limits, it gets tricky, and the intuitive stuff just does not work. I do not know if “partial outer join” is the right term, but it’s what I’ll use.

# UNION does not duplicate values, but only if all the values match:
SELECT name,'' FROM static union SELECT s.name,COUNT(d.joinme) FROM static s LEFT JOIN dynamic d USING (joinme) where d.id<15 group by name order by name;
gives:
+-------+---+
| name | |
+-------+---+
| aaaaa | |
| aaaaa | 1 |
| bbbbb | |
| ccccc | |
| ccccc | 1 |
....

# similarly, using 0....:
SELECT name,0 FROM static union SELECT s.name,COUNT(d.joinme) FROM static s LEFT JOIN dynamic d USING (joinme) where d.id<15 group by name order by name;
+-------+---+
| name | 0 |
+-------+---+
| aaaaa | 0 |
| aaaaa | 1 |
| bbbbb | 0 |
| ccccc | 1 |
| ccccc | 0 |
.....

# Maybe a HAVING clause?
SELECT s.name,COUNT(d.joinme),d.id FROM static s LEFT JOIN dynamic d USING (joinme) group by name having d.id<15;
# does not work (and gives me an extra (and nonsensical) field, because you can't put something in a HAVING clause unless it's selected).

#putting a JOIN in the FROM clause (not a good idea, but I'd do it if it worked):
SELECT s.name,COUNT(d.joinme) FROM static s LEFT JOIN (SELECT joinme,id from dynamic where id<15) d USING (joinme) where d.id<15 GROUP BY name;


So what’s the answer? Well, the answer actually lies in the JOIN, which is a bit tricky. Basically, I was thinking of ON and USING as “what you join the tables on.” While that’s true, particularly for USING, it’s a bit different for ON.

The manual page at:

http://dev.mysql.com/doc/refman/4.1/en/join.html

does not particularly make things clear.

However, the first comment makes it clear:
The join_condition is used for winnowing the rows of the “right half” that you want to use for the cross, whereas the WHERE clause is used for winnowing the composite.

In other words, putting a condition in the WHERE clause means it’s going to limit the results; if you put a condition in the join condition, it limits each part before the actual join. This was not intuitive to me at all. However, in playing around with different possibilities to send to the list a message of “I’m frustrated, here’s everything I’ve tried!”, I ended up finding the answer.

I hope this is helpful to others as well…..

I was wracking my brain for a bit, and could not figure out how to do a partial outer join where I was limiting rows from one table. So, I went to post to the MySQL user list — in writing my post, I found the answer. This often happens. It’s not that I don’t have questions to ask the MySQL user list (or forums), it’s just that most of the time, while I’m writing up the question, I find my answer.

First, I simplified my question. Many people do this, but they leave out important information. The trick is to come up with a simplification that you test to make sure it breaks in the same way the more complex, “real” query breaks.

The problem is basically this:
I want an outer join of a static table with a data table, such that I get a report of how many rows in the data table match the static table — including “0” if there are none, but for a subset of the data in the data table. Consider a billing report of how many invoices were generated per date, for a specified time period.

This should be easy, right? It was not very intuitive for me.

In this system, I’m using MySQL 4.1.19-standard-log so stored procedures aren’t an option. Also, I’d like to avoid a loop — the standard way of doing this (the “developer” way) is to get all the values in the static table, do a count of rows in the dynamic table for each row in the static table, and then display. However, I wanted to do it in very few queries, and so that the number of queries did not depend on the static data. I call this doing it the “DBA way”.

First, the setup:

use test;

CREATE TABLE `dynamic` (
`id` int(10) unsigned NOT NULL auto_increment,
`joinme` char(1) default NULL,
PRIMARY KEY (`id`));

CREATE TABLE `static`
( `joinme` char(1) default NULL,
`name` char(5) default NULL);

insert into static (joinme) VALUES('a'),('b'),('c'),('d'),
('e'),('f'),('g'),('h'),('i'),
('j'),('k'),('l'),('m'),('n'),
('o'),('p'),('q'),('r'),('s'),
('t'),('u'),('v'),('w'),('x'),
('y'),('z');

update static set name=repeat(joinme,5);

insert into dynamic (joinme) select joinme from static order by rand() limit 10;
insert into dynamic (joinme) select joinme from static order by rand() limit 10;
insert into dynamic (joinme) select joinme from static order by rand() limit 10;
insert into dynamic (joinme) select joinme from static order by rand() limit 10;
insert into dynamic (joinme) select joinme from static order by rand() limit 10;

First, a test to make sure everything’s OK:
# The following will give you the random distribution you have just made:
select count(id), joinme
from dynamic
group by joinme order by joinme;

# the result is < = 26 rows, depending on the random distribution.

# the following will show the distribution of all 26 names (aaaaa through zzzzz) represented in the dynamic table (well, all 26 joinme values, really, but that's irrelevant to the point)
SELECT s.name,COUNT(d.joinme)
FROM static s LEFT JOIN dynamic d USING (joinme)
GROUP BY name;

# and indeed, it gets the expected 26 rows, with 0 filled in for the count when there's no match in the data table.

So far, so good. Remember, I don’t want to join all of the entries in the dynamic table….because it could be millions of entries, and maybe I want to limit it to id’s of a certain number. So what if I want the distribution of names given the first 15 id’s?

Using:
SELECT s.name,COUNT(d.joinme)
FROM static s LEFT JOIN dynamic d USING (joinme)
WHERE d.id<15 group by name;

gives me the distribution, but only for the values that among those ids. So in one test, I get 13 rows, all with a count>0, instead of 26 rows, with some of them having a count of 0 and others having a count>0.

This breaks in the same way my more complex query.

How can I limit the values from one table and outer join them to another, retaining everything? This seems like a very simple outer join, but because I want to put limits, it gets tricky, and the intuitive stuff just does not work. I do not know if “partial outer join” is the right term, but it’s what I’ll use.

# UNION does not duplicate values, but only if all the values match:
SELECT name,'' FROM static union SELECT s.name,COUNT(d.joinme) FROM static s LEFT JOIN dynamic d USING (joinme) where d.id<15 group by name order by name;
gives:
+-------+---+
| name | |
+-------+---+
| aaaaa | |
| aaaaa | 1 |
| bbbbb | |
| ccccc | |
| ccccc | 1 |
....

# similarly, using 0....:
SELECT name,0 FROM static union SELECT s.name,COUNT(d.joinme) FROM static s LEFT JOIN dynamic d USING (joinme) where d.id<15 group by name order by name;
+-------+---+
| name | 0 |
+-------+---+
| aaaaa | 0 |
| aaaaa | 1 |
| bbbbb | 0 |
| ccccc | 1 |
| ccccc | 0 |
.....

# Maybe a HAVING clause?
SELECT s.name,COUNT(d.joinme),d.id FROM static s LEFT JOIN dynamic d USING (joinme) group by name having d.id<15;
# does not work (and gives me an extra (and nonsensical) field, because you can't put something in a HAVING clause unless it's selected).

#putting a JOIN in the FROM clause (not a good idea, but I'd do it if it worked):
SELECT s.name,COUNT(d.joinme) FROM static s LEFT JOIN (SELECT joinme,id from dynamic where id<15) d USING (joinme) where d.id<15 GROUP BY name;


So what’s the answer? Well, the answer actually lies in the JOIN, which is a bit tricky. Basically, I was thinking of ON and USING as “what you join the tables on.” While that’s true, particularly for USING, it’s a bit different for ON.

The manual page at:

http://dev.mysql.com/doc/refman/4.1/en/join.html

does not particularly make things clear.

However, the first comment makes it clear:
The join_condition is used for winnowing the rows of the “right half” that you want to use for the cross, whereas the WHERE clause is used for winnowing the composite.

In other words, putting a condition in the WHERE clause means it’s going to limit the results; if you put a condition in the join condition, it limits each part before the actual join. This was not intuitive to me at all. However, in playing around with different possibilities to send to the list a message of “I’m frustrated, here’s everything I’ve tried!”, I ended up finding the answer.

I hope this is helpful to others as well…..

Those Who Can, Do

I finally got around to reading The Cathedral and the Bazaar — I’d known about the collection of essays since it was first published in 1999, and kept meaning to read it. Heck, that link goes to the entirety of the essays, free, online. However, I never did read it, but I recently received a copy of the book, and have been reading it on my commute to work.

It’s very engaging, and still accurate, even now. Of course I think about how MySQL’s growth and business model as I’m reading it.

One of the tenets that keeps appearing is that “Brooks’ law that adding more people adds more complexity and actually slows a project down does not apply to open source.” He gives a few reasons, mostly communication, but there’s one crucial point he misses.

Brooks’ law does not apply to open source because there are little or no startup or coming to speed costs. The first contact with a developer is usually “Here’s the patch I’m submitting.” Now, perhaps that’s included in the management overhead that the essays refer to. However, if that’s the case it is completely unclear. I do not think this changes any parts of the arguments, nor any of the conclusions. But I do think it is important to note that not only are the candidates for modifying open source software filtered on self-selection, they are filtered on ability to actually do the work.

In a traditional job interview, there is very little “testing” done. Sure, there are often questions, some technical, others not; but interviewing is a completely different environment than actually doing the job. Communication may be poorer due to nerves or self-doubt in an interview, when it might not otherwise occur during the actual job. Very occasionally, the traditional closed source world will give a job candidate a task to do on their own, and the completed task and resultant communication are the merits on which hiring is based.

If closed source hiring worked the same way open source hiring (whether for a volunteer position or paid) works, perhaps Brooks’ law would also not apply. Why is this not done, then? Mostly it’s seen that a new employee or candidate would need their hands held, or that they cannot be allowed to see the code, because it’s closed source, and the (or a) revenue stream.

At any rate, those who work and succeed on open source software, whether they garner a paycheck or not, are by definition good at what they do, because they’ve already succeeded. In a traditional job, the costs of finding someone new and the liability of letting someone go because they “aren’t producing” are high. Therefore, it’s much easier to coast in these positions, and there are some folks who do. I have seen some rather impressive resumes for some rather unimpressive job candidates/co-workers.

I have seen some pretty unimpressive open source software (heck, I’ve written some myself); but for the most part, it all works as described. It may lack features, be very inefficient, or be limited on which systems it runs, but it almost always works without hacking. It’s difficult to say the same about closed source; one only needs to look at PeopleSoft or Remedy to see that. I’m sure there are plenty of examples; those are two examples I’m familiar with.

At any rate, the point is, those who can, do. That’s pretty much the motto of open source. I have some co-workers who rib me for donating to the EFF in order to get my name in the MySQL Source Code, because “it’s open source. You can get in there much more easily by submitting a patch.” This is true, in theory. In reality, I haven’t written any C code nor looked at code more complicated than a shell script in 5 years. One of the projects in the pipeline is to delve into the source code, but I’m not there yet.

So instead of submitting patches, I run the Boston MySQL User Group. I write articles in this blog. I help others optimize queries and MySQL configurations. We all do what we can, and that’s what makes open source great.

So the point of this? Do not be intimidated or discouraged because others are successful. Use them as a model, work hard, do what you can do, and you will be successful, too.

I finally got around to reading The Cathedral and the Bazaar — I’d known about the collection of essays since it was first published in 1999, and kept meaning to read it. Heck, that link goes to the entirety of the essays, free, online. However, I never did read it, but I recently received a copy of the book, and have been reading it on my commute to work.

It’s very engaging, and still accurate, even now. Of course I think about how MySQL’s growth and business model as I’m reading it.

One of the tenets that keeps appearing is that “Brooks’ law that adding more people adds more complexity and actually slows a project down does not apply to open source.” He gives a few reasons, mostly communication, but there’s one crucial point he misses.

Brooks’ law does not apply to open source because there are little or no startup or coming to speed costs. The first contact with a developer is usually “Here’s the patch I’m submitting.” Now, perhaps that’s included in the management overhead that the essays refer to. However, if that’s the case it is completely unclear. I do not think this changes any parts of the arguments, nor any of the conclusions. But I do think it is important to note that not only are the candidates for modifying open source software filtered on self-selection, they are filtered on ability to actually do the work.

In a traditional job interview, there is very little “testing” done. Sure, there are often questions, some technical, others not; but interviewing is a completely different environment than actually doing the job. Communication may be poorer due to nerves or self-doubt in an interview, when it might not otherwise occur during the actual job. Very occasionally, the traditional closed source world will give a job candidate a task to do on their own, and the completed task and resultant communication are the merits on which hiring is based.

If closed source hiring worked the same way open source hiring (whether for a volunteer position or paid) works, perhaps Brooks’ law would also not apply. Why is this not done, then? Mostly it’s seen that a new employee or candidate would need their hands held, or that they cannot be allowed to see the code, because it’s closed source, and the (or a) revenue stream.

At any rate, those who work and succeed on open source software, whether they garner a paycheck or not, are by definition good at what they do, because they’ve already succeeded. In a traditional job, the costs of finding someone new and the liability of letting someone go because they “aren’t producing” are high. Therefore, it’s much easier to coast in these positions, and there are some folks who do. I have seen some rather impressive resumes for some rather unimpressive job candidates/co-workers.

I have seen some pretty unimpressive open source software (heck, I’ve written some myself); but for the most part, it all works as described. It may lack features, be very inefficient, or be limited on which systems it runs, but it almost always works without hacking. It’s difficult to say the same about closed source; one only needs to look at PeopleSoft or Remedy to see that. I’m sure there are plenty of examples; those are two examples I’m familiar with.

At any rate, the point is, those who can, do. That’s pretty much the motto of open source. I have some co-workers who rib me for donating to the EFF in order to get my name in the MySQL Source Code, because “it’s open source. You can get in there much more easily by submitting a patch.” This is true, in theory. In reality, I haven’t written any C code nor looked at code more complicated than a shell script in 5 years. One of the projects in the pipeline is to delve into the source code, but I’m not there yet.

So instead of submitting patches, I run the Boston MySQL User Group. I write articles in this blog. I help others optimize queries and MySQL configurations. We all do what we can, and that’s what makes open source great.

So the point of this? Do not be intimidated or discouraged because others are successful. Use them as a model, work hard, do what you can do, and you will be successful, too.

MySQL Related Nagios Plugins

After seeing Frank’s script in “Finding out how far behind are slaves” I figured I would post some nagios plugins I wrote a while ago, that we use at our company. They’re already up at http://www.nagiosexchange.org/, the 3rd party repository for nagios plugins.

So I figured I’d point you to:
Replication Lag Time
and
InnoDB Free Space

I’ll note that backups from a slave server can cause replication lag time to fall behind, for 2 reasons — mysqldump falls under “heavy SELECTs” as Frank has mentioned, but also zipping up logs or general high load/high CPU usage on a server can also cause the database to fall behind, because less CPU time is given to MySQL.

After seeing Frank’s script in “Finding out how far behind are slaves” I figured I would post some nagios plugins I wrote a while ago, that we use at our company. They’re already up at http://www.nagiosexchange.org/, the 3rd party repository for nagios plugins.

So I figured I’d point you to:
Replication Lag Time
and
InnoDB Free Space

I’ll note that backups from a slave server can cause replication lag time to fall behind, for 2 reasons — mysqldump falls under “heavy SELECTs” as Frank has mentioned, but also zipping up logs or general high load/high CPU usage on a server can also cause the database to fall behind, because less CPU time is given to MySQL.

Populating Images Into a Test Database

Guiseppe Maxia recently posted on Filling Test Tables Quickly. However, he only gave examples for numbers, words and dates. This article will deal with randomly inserting images into a database — particularly useful for testing images stored in the database versus the filesystem. This article was inspired by Guiseppe’s article.

[note: there will be a post soon with my presentation on storing images in the MySQL filesystem, and there will be ample opportunity for discussion then, especially since everyone making a claim will have tested their system, given this code.]

Getting images into the database is not easy. All the tutorials I have read have used a programming language to get images into a database. Using cat and similar unix tools, I have attempted to be able to insert an image into a database using only the mysql client commandline tool. That failed — I would love to know if anyone has any tips on inserting an image into a database without using a programming language.

So, I compromised. I put 20 images into a table as a baseline, and then wrote a stored procedure to suit my needs. Basically, users can have up to 6 images, so I wanted to replicate that possibility in the test db.

This stored procedure takes in 3 integers: maximum # of images per uid, range of random images from the “originals” table, and # of uids to use when populating the table. You will also need a table called “originals” containing the original images. A sample table, with 20 images, can be downloaded from:
http://www.sheeri.net/public/originalimages.sql.gz.

These images were taken from:
http://www.takigen.co.jp/japanese/images/number without explicit permission. I have no idea what Takigen is as I do not read Japanese. I ran a Google Image Search for tiny pictures involving numbers, and found that website that way.

To import it, run at the shell prompt:

> gunzip originalimages.sql.gz
> mysql dbname < originalimages.sql

(explanation below the code)


delimiter $$

drop procedure if exists populate_images $$
create procedure populate_images(max_images INT,num_originals INT,finalcount INT)
not deterministic
modifies sql data
begin
declare uid int default 0;
declare num_images int default 0;
declare image_num int default 0;
declare image_counter int default 0;
set max_images:=max_images+1;
set num_originals:=num_originals+1;
while uid < = finalcount do set image_counter:=0; set num_images:=FLOOR(RAND() * max_images); while image_counter < num_images do set image_num:=FLOOR(RAND() * num_originals); INSERT INTO images (uid,image,uploaded) SELECT uid,image,NOW() from originals where imageNum=image_num; set image_counter:=image_counter+1; end while; set uid:=uid+1; end while; end $$

delimiter ;


Explanation:
declare uid int default 0;
Initialize the counter for the current uid.

declare num_images int default 0;
Initialize the variable for the number of images per user; selected by random with a maximum as given in the 1st argument to the procedure.

declare image_num int default 0;
Initialze the variable that will hold which image to retrieve from the originals table. Selected by random for each image.

declare image_counter int default 0;
Initialize the counter for the image number; reset to 0 for each uid.

set max_images:=max_images+1;
set num_originals:=num_originals+1;

Variables are increased by 1 because I used < in the code instead of <= . I guess one could change the code to reflect this. while uid < = finalcount do
loop through the appropriate # of uids…

set image_counter:=0;
set num_images:=FLOOR(RAND() * max_images);

reset the image counter and get a new # of images for this particular uid, between 0 and max_images. If you want to specify between 1 and max_images, change to 1+FLOOR(RAND() * max_images)

while image_counter < num_images do set image_num:=FLOOR(RAND() * num_originals);
Get a random number between 1 and num_originals (ie, how many images there are to choose from).

INSERT INTO images (uid,image,uploaded) SELECT uid,image,NOW() from originals where imageNum=image_num;
Insert the random image and the uid and time uploaded into the images table.

set image_counter:=image_counter+1;
end while;

increment the image counter — you are finished with this image for this uid

set uid:=uid+1;
end while;
Increment the uid counter -- you are finished with this uid

end $$
Procedure end


Giuseppe’s article talked about randomly filling things quickly. This isn’t as lightning fast as other methods. To wit:

5 uids produced 20 image entries and took 1.14 seconds. (using call populate_images(6,20,5);)
50 uids produced 126 image entries and took 7.37 seconds.
500 uids produced 1416 image entries and took 1 min 26.48 seconds.
5,000 uids produced 14049 image entries and took 14 min 40.68 seconds.
(it scales linearly per image entry, not suprisingly).
So 50,000 uids would take about 2.5 hours, and 500,000 entries would take about a day to randomly fill.

Alternatively, using this procedure to randomly fill 5 uids and then repeating
insert into images (image,uid,uploaded) select image,uid,NOW() from images;

Takes:
5 uids, 1.33 seconds
50 uids, 1.60 seconds
500 uids, 2.03 seconds
5000 uids, 3.08 seconds
50000 uids, 30.45 seconds
500000 uids, 8 min, 47.10 seconds

But that’s only got 5 really random ones, copied a bunch of times. It may not be random enough. How about 1,000 random ones, and then copy them (I didn’t change uids, although I should have, I just didn’t feel like coding it):

call populate_images(6,20,1000);
insert into images (image,uid,uploaded) select image,uid,NOW() from images;

(repeating that last line)

Takes:
1000 uids, 2 min 53.66 seconds
5000 uids, 2 min 54.24 seconds
50000 uids, 3 min 31.91 seconds
500000 uids, 20 min 45.08 seconds

Comments and suggestions welcome.

Guiseppe Maxia recently posted on Filling Test Tables Quickly. However, he only gave examples for numbers, words and dates. This article will deal with randomly inserting images into a database — particularly useful for testing images stored in the database versus the filesystem. This article was inspired by Guiseppe’s article.

[note: there will be a post soon with my presentation on storing images in the MySQL filesystem, and there will be ample opportunity for discussion then, especially since everyone making a claim will have tested their system, given this code.]

Getting images into the database is not easy. All the tutorials I have read have used a programming language to get images into a database. Using cat and similar unix tools, I have attempted to be able to insert an image into a database using only the mysql client commandline tool. That failed — I would love to know if anyone has any tips on inserting an image into a database without using a programming language.

So, I compromised. I put 20 images into a table as a baseline, and then wrote a stored procedure to suit my needs. Basically, users can have up to 6 images, so I wanted to replicate that possibility in the test db.

This stored procedure takes in 3 integers: maximum # of images per uid, range of random images from the “originals” table, and # of uids to use when populating the table. You will also need a table called “originals” containing the original images. A sample table, with 20 images, can be downloaded from:
http://www.sheeri.net/public/originalimages.sql.gz.

These images were taken from:
http://www.takigen.co.jp/japanese/images/number without explicit permission. I have no idea what Takigen is as I do not read Japanese. I ran a Google Image Search for tiny pictures involving numbers, and found that website that way.

To import it, run at the shell prompt:

> gunzip originalimages.sql.gz
> mysql dbname < originalimages.sql

(explanation below the code)


delimiter $$

drop procedure if exists populate_images $$
create procedure populate_images(max_images INT,num_originals INT,finalcount INT)
not deterministic
modifies sql data
begin
declare uid int default 0;
declare num_images int default 0;
declare image_num int default 0;
declare image_counter int default 0;
set max_images:=max_images+1;
set num_originals:=num_originals+1;
while uid < = finalcount do set image_counter:=0; set num_images:=FLOOR(RAND() * max_images); while image_counter < num_images do set image_num:=FLOOR(RAND() * num_originals); INSERT INTO images (uid,image,uploaded) SELECT uid,image,NOW() from originals where imageNum=image_num; set image_counter:=image_counter+1; end while; set uid:=uid+1; end while; end $$

delimiter ;


Explanation:
declare uid int default 0;
Initialize the counter for the current uid.

declare num_images int default 0;
Initialize the variable for the number of images per user; selected by random with a maximum as given in the 1st argument to the procedure.

declare image_num int default 0;
Initialze the variable that will hold which image to retrieve from the originals table. Selected by random for each image.

declare image_counter int default 0;
Initialize the counter for the image number; reset to 0 for each uid.

set max_images:=max_images+1;
set num_originals:=num_originals+1;

Variables are increased by 1 because I used < in the code instead of <= . I guess one could change the code to reflect this. while uid < = finalcount do
loop through the appropriate # of uids…

set image_counter:=0;
set num_images:=FLOOR(RAND() * max_images);

reset the image counter and get a new # of images for this particular uid, between 0 and max_images. If you want to specify between 1 and max_images, change to 1+FLOOR(RAND() * max_images)

while image_counter < num_images do set image_num:=FLOOR(RAND() * num_originals);
Get a random number between 1 and num_originals (ie, how many images there are to choose from).

INSERT INTO images (uid,image,uploaded) SELECT uid,image,NOW() from originals where imageNum=image_num;
Insert the random image and the uid and time uploaded into the images table.

set image_counter:=image_counter+1;
end while;

increment the image counter — you are finished with this image for this uid

set uid:=uid+1;
end while;
Increment the uid counter -- you are finished with this uid

end $$
Procedure end


Giuseppe’s article talked about randomly filling things quickly. This isn’t as lightning fast as other methods. To wit:

5 uids produced 20 image entries and took 1.14 seconds. (using call populate_images(6,20,5);)
50 uids produced 126 image entries and took 7.37 seconds.
500 uids produced 1416 image entries and took 1 min 26.48 seconds.
5,000 uids produced 14049 image entries and took 14 min 40.68 seconds.
(it scales linearly per image entry, not suprisingly).
So 50,000 uids would take about 2.5 hours, and 500,000 entries would take about a day to randomly fill.

Alternatively, using this procedure to randomly fill 5 uids and then repeating
insert into images (image,uid,uploaded) select image,uid,NOW() from images;

Takes:
5 uids, 1.33 seconds
50 uids, 1.60 seconds
500 uids, 2.03 seconds
5000 uids, 3.08 seconds
50000 uids, 30.45 seconds
500000 uids, 8 min, 47.10 seconds

But that’s only got 5 really random ones, copied a bunch of times. It may not be random enough. How about 1,000 random ones, and then copy them (I didn’t change uids, although I should have, I just didn’t feel like coding it):

call populate_images(6,20,1000);
insert into images (image,uid,uploaded) select image,uid,NOW() from images;

(repeating that last line)

Takes:
1000 uids, 2 min 53.66 seconds
5000 uids, 2 min 54.24 seconds
50000 uids, 3 min 31.91 seconds
500000 uids, 20 min 45.08 seconds

Comments and suggestions welcome.

Sakila Knit

One of the things I did this weekend was knit the pattern I’d made for Sakila, the dolphin in the MySQL logo. Click on the image for a bigger picture:

The only problem is I have no idea what to do with it. I have more of the orange and blue yarn. I thought I would make it into a purse but it turned out much wider than I expected. I could make it into a big handbag but I don’t think I’d use it. Any suggestions?

The pattern itself:


dolphin: (56 sts wide by 59 rows)
When I say "k to stitch marker and remove", I mean knit to the stitch marker, and take the stitch marker off.

cast on 56 sts.
knit 2 rows

I had to tweak the pattern mid-knit, so I *think* this is the correct pattern. If the count looks OK for rows 40-56 then there's nothing to worry about. Please let me know if you try this. I used size 9 needles and Lily Sugar & Cream Yarn in orange and navy blue. As you can see in the picture above, it turns out to be over a foot wide.

1) k4, p5, k1, place stitch marker, k to end
2) p to stitch marker & remove, k7, p3
3) k2, p3, k1, p6, k2, place stitch marker, k to end
4) p to stitch marker & remove, k4, p6, k2, p2
5) k2, p3, k6, p7, k4, place stitch marker, k to end
6) p to stitch marker & remove, k10, p6, k3, p3
7) k4, p2, k12, p5, k2, place stitch marker, k to end
8) p to stitch marker & remove, k6, p12, k3, p4
9) k5, p3, k14, p4, k2, place stitch marker, k to end
10) p to stitch marker & remove, k4, p16, k2, p to end
11) k6, p3, k5, p2, k9, p4, k1, place stitch marker, k to end
12) p to stitch marker & remove, k3, p10, k3, p4, k3, p to end
13) k7, p3, k5, p2, k11, p3, k1, place stitch marker, k to end
14) p to stitch marker & remove, k3, p19, k2, p to end
15) k8, p2, k20, p4, k1, place stitch marker, k to end
16) p to stitch marker & remove, k4, p20, k3, p to end
17) k8, p3, k21, p3, k1, place stitch marker, k to end
18) p to stitch marker & remove,k4, p20, k2, p to end
19) k10, p2, k21, p3, k1, place stitch marker, k to end
20) p to stitch marker & remove,k3, p21, k2, p to end
21) k11, p3, k21, p2, k1, place stitch marker, k to end
22) p to stitch marker & remove, k3, p21, k2, p to end
23) k12, p2, k22, p2, k1, place stitch marker, k to end
24) p to stitch marker & remove, k3, p23, k1, p to end
25) k12, p1, k23, p3, k1, place stitch marker, k to end
26) p to stitch marker & remove,k3, p24, k2, p to end
27) k11, p2, k25, p2, place stitch marker, k to end
28) p to stitch marker & remove, k2, p25, k3, p to end
29) k10, p2, k26, p3, place stitch marker, k to end
30) p to stitch marker & remove, k2, p27, k2, p to end
31) k10, p2, k5, p2, k20, p3, place stitch marker, k to end
32) p to stitch marker & remove, k2, p20, k3, p4, k3, p to end
33) k11, p2, k3, p4, k20, p3, k1, place stitch marker, k to end
34) p to stitch marker & remove, k3, p20, k5, p3, k2, p to end
35) k11, p2, k3, p5, k20, p4, k3, place stitch marker, k to end
36) p to stitch marker & remove, k6, p20, k3, p1, k2, p3, k1, p to end
37) k12, (*p2, k2* 2 times), p2, k22, p5, k1, place stitch marker, k to end
38) p to stitch marker & remove, k4, p23, k2, p3, k6, p to end
39) k13, p4, k4, p2, k25, p3, k1, place stitch marker, k to end
40) p to stitch marker & remove, k3, p25, k2, p5, k3, p to end
41) k23, p1, k26, p3, k4
42) p3, k3, p26, k2, place stitch marker, p to end
43) k to stitch marker & remove, p2, k26, p4, k2
44) p2, k8, p21, k2, place stitch marker, k1, p to end
45) k to stitch marker & remove, p2, k18, p8, k5
46) p9, k4, p17, k2, place stitch marker, k1, p to end
47) k to stitch marker & remove, p2, k18, p3, k9
48) p8, k4, p17, k2, place stitch marker, k1, p to end
49) k to stitch marker & remove, p3, k17, p4, k7
50) p5, k5, p18, k2, place stitch marker, p to end
51) k to stitch marker & remove, p3, k18, p4, k5
52) p5, k3, p19, k2 place stitch marker, p to end
53) k to stitch marker & remove, p3, k19, p2, k5
54) p4, k2, p19, k3, place stitch marker, p to end
55) k to stitch marker & remove, p4, k19, p2, k3
k 2 rows

One of the things I did this weekend was knit the pattern I’d made for Sakila, the dolphin in the MySQL logo. Click on the image for a bigger picture:

The only problem is I have no idea what to do with it. I have more of the orange and blue yarn. I thought I would make it into a purse but it turned out much wider than I expected. I could make it into a big handbag but I don’t think I’d use it. Any suggestions?

The pattern itself:

Continue reading “Sakila Knit”