Sheeri Dot Org – Page 54 – Musings on topics including product management, tech, art.

New Book Available: SQL Hacks

Like many other in the “Hacks” series, this book offers 100 tips and tricks for making the most of your database through querying.

See more information at:
http://www.oreilly.com/catalog/sqlhks/

It is about to be published (Nov. 2006). I believe you can order the final cut (as opposed to the “rough cut”), although last week they were still advertising it as a rough cut…now you can order it online, in a print book or to read on Safari.

As a technical editor for this book, I have read through all of it (or at least, all of it that was there in the draft I received) and I can say it is definitely worth the $30 USD. Most of the examples at the time I read it were in MySQL format (though they might have changed that to be more fair to other databases); while it has some material for DB2 and Access, it’s mostly a book for MySQL, Oracle and SQL Server (in that order).

There are hacks for all levels, and all abilities. If you’re a beginner who cannot seem to read the manual, or an expert who refers to it all the time, there’s plenty in this book for you. I learned a lot just by having to edit it, and in fact rewrote a couple of important queries using what I had learned in the book. I would put it on a ‘top 5 books to read if you are a DBA or write SQL’.

Like many other in the “Hacks” series, this book offers 100 tips and tricks for making the most of your database through querying.

See more information at:
http://www.oreilly.com/catalog/sqlhks/

Stopwords in Multiple/Other Languages

Life is slowly returning to order once again. I am attempting to slog through almost 1,000 messages in my MySQL folder, most of which are list questions that have already been answered, so it does not take long to get through them. However, occasionally I find a question that has not been answered, or a gem of a question that I want to expose to a wider audience.

This question fell under both categories. Basically, someone wanted stopwords in other languages, and wondered if there was a place to get them. (English stopwords can be found at http://dev.mysql.com/doc/refman/5.0/en/fulltext-stopwords.html.)

I did a quick web search and found a site that has a bunch of language stopwords:
http://www.ranks.nl/stopwords/

Catalan stopwords
Czech stopwords
Danish stopwords
Dutch stopwords
French stopwords
English stopwords (default)
German stopwords
Hungarian stopwords
Italian stopwords
Norwegian stopwords
Polish stopwords
Portugese stopwords
Spanish stopwords
Turkish stopwords

I hope this helps some folks…..

And now, a tricker question — if there are folks doing fulltext matching in other languages, what is your list of stopwords or where did you get it from? (I am very sure there are tons of sites in the native langauge that lists stopwords, but for admins that do not speak every language their application supports, they can be hard to find!)

What about folks doing searches within a field that may contain multiple languages? Have you created a file to include the stopwords all languages that your application supports? If you have not, should you?

(The documentation on how to change the stopword file parameter is at http://dev.mysql.com/doc/refman/5.0/en/fulltext-fine-tuning.html)

I did a quick web search and found a site that has a bunch of language stopwords:
http://www.ranks.nl/stopwords/

I hope this helps some folks…..

(The documentation on how to change the stopword file parameter is at http://dev.mysql.com/doc/refman/5.0/en/fulltext-fine-tuning.html)

Calling application on the database server from stored procedure or trigger

I do not know enough about the subject to answer this, but I know the folks that read this do.

From: http://mysql.meetup.com/137/boards/view/viewthread?thread=2280640

Hi,

I wonder is there any way to call an application which resides on the database server computer from stored procedure or trigger? Is the language of stored procedures and triggers only restricted by SQL statements and some arithmetic operations?

It is interesting that Oracle and MS SQL Server already have this functionality. In Oracle it is Java stored procedure, in MS SQL Server it is called “extended stored procedure” where you can call C++ routine out of your stored procedure. It is difficult overestimate the convenience of being able to do some processing triggerted by the record insertion event.

In this case the database could be used not only for “pulling” data out, but also being able to “push” data to some component.

You can go to the meetup message boards and post an answer. If you’d like to post anonymously, feel free to use the account:

admin@sheeri.net
and password
guest

Thanx!

I do not know enough about the subject to answer this, but I know the folks that read this do.

From: http://mysql.meetup.com/137/boards/view/viewthread?thread=2280640

Hi,

I wonder is there any way to call an application which resides on the database server computer from stored procedure or trigger? Is the language of stored procedures and triggers only restricted by SQL statements and some arithmetic operations?

It is interesting that Oracle and MS SQL Server already have this functionality. In Oracle it is Java stored procedure, in MS SQL Server it is called “extended stored procedure” where you can call C++ routine out of your stored procedure. It is difficult overestimate the convenience of being able to do some processing triggerted by the record insertion event.

In this case the database could be used not only for “pulling” data out, but also being able to “push” data to some component.

You can go to the meetup message boards and post an answer. If you’d like to post anonymously, feel free to use the account:

admin@sheeri.net
and password
guest

Thanx!

How Not to Check Passwords

So I found this piece of code today:

public final boolean isValidPassword(String password) { String inputHash = Crypto.hash(password); String correctHash = getPasswordHash(); return inputHash.equals(correctHash); }

I am not quite sure what the thought process was behind this — getPasswordHash is a method that simply retrieves a field from the database, so this method gets the password has from the database, hashes the password given, and then uses String.equals() to compare the two.

Why on earth would someone do this instead of just checking the password? I totally understand if the getPasswordHash() method salted the password, or something, but it does not…….

So I found this piece of code today:

public final boolean isValidPassword(String password) { String inputHash = Crypto.hash(password); String correctHash = getPasswordHash(); return inputHash.equals(correctHash); }

Why on earth would someone do this instead of just checking the password? I totally understand if the getPasswordHash() method salted the password, or something, but it does not…….

October Boston MySQL User Group Topic: Boolean Values and Bit Operators

Boston October MySQL User Group: see full event listings at:

http://mysql.meetup.com/137/calendar/5118339/

Tuesday, Oct. 10th at MIT, free pizza and soda (thanks to MySQL, AB and the MIT community). Please RSVP!!

To RSVP anonymously, please login to the Meetup site with the e-mail address “admin at sheeri dot com” and the password “guest”.

Plenty of free parking (you can park in MIT lots after 3 pm); 1 block south of the Kendall Square T stop.

————–

Most of the September Boston User Group was spent discussing an interesting problem with a large amount of data (5 million records). Basically, this data had about 40 boolean (or small set) fields that needed to be able to be searched against. Folks suggested:

1) Just leaving the table as is and using 1-character values
Pro: simple
Con: Indexes are bad for columns with low selectivity, searching will take a long time due to full table scans

2) Creating a “joining” table for each boolean value
Pro: Indexing for each boolean value can be used
Con: Complex — lots of tables, lots of joins for search

3) Using BIT(1) values or BIT(2) values and matching up booleans
Pro: Simple
Con: Difficult to write the search query, keeping in mind the search terms given below.

The biggest issue is the accuracy of indexes vs. size/amount of tables and joins. The person with the original problem (Chris) and I are doing a joint presentation, with real data on those three cases to figure out which is the best for his situation.

What have other folks done for boolean values? Please be specific about the amount of data, and the performance. Remember that this situation involves a lot of data and a lot of boolean fields, and searching across any or all boolean/small set fields is a core function. As well, fields may be null, and searching may include:

For boolean:
search for 0
search for 1
search for 0 or 1 (any value set)
search for NULL (any value not set)
search for 0 or NULL
search for 1 or NULL

For small sets:
search for ‘a’ (single value match)
search for ‘a’,’b’, and ‘c’ (multiple values will match)
search for ‘any value not null’ (anything not null)
search for ‘any value including null’ (anything null)

Any ideas? I will do some quick research if there’s another option that the September User Group did not come up with.

Boston October MySQL User Group: see full event listings at:

http://mysql.meetup.com/137/calendar/5118339/

Tuesday, Oct. 10th at MIT, free pizza and soda (thanks to MySQL, AB and the MIT community). Please RSVP!!

To RSVP anonymously, please login to the Meetup site with the e-mail address “admin at sheeri dot com” and the password “guest”.

Plenty of free parking (you can park in MIT lots after 3 pm); 1 block south of the Kendall Square T stop.

————–

1) Just leaving the table as is and using 1-character values
Pro: simple
Con: Indexes are bad for columns with low selectivity, searching will take a long time due to full table scans

2) Creating a “joining” table for each boolean value
Pro: Indexing for each boolean value can be used
Con: Complex — lots of tables, lots of joins for search

3) Using BIT(1) values or BIT(2) values and matching up booleans
Pro: Simple
Con: Difficult to write the search query, keeping in mind the search terms given below.

For boolean:
search for 0
search for 1
search for 0 or 1 (any value set)
search for NULL (any value not set)
search for 0 or NULL
search for 1 or NULL

Any ideas? I will do some quick research if there’s another option that the September User Group did not come up with.

Isolation & Concurrency

A few meetups ago, we had Jim Starkey come and speak. During that evening, he asked “Why is MVCC not serializable?” It was a “go and do your homework” kind of question.

So, in order to answer that question, let’s explore the concepts of isolation and concurrency, the “i” and “c” in ACID compliance. For this post, I will refer to “transactions”, by which I mean actual transactions, as well as statements where autocommit=0. In other words, every statement is its own transaction, unless it is within an explicit transaction, and for the purpose of this article, those statements are “transactions”. Of course, for many of the explanations below, if there’s only one statement in the transaction, there’s no problem. (As usual, if I make a mistake, please correct me. I will make sure all comments from all sources get to the article at http://www.sheeri.net/123 .)

Isolation levels define how separate the transaction data is from the actual data in the table. So if I begin a transaction, read a row from the table, change the row, read the row from the table, and commit, what do I see at each step? What does someone else who is also reading the same row see?

Isolation Level: Read Uncommitted
#	Step	Thread 1: changing data	Thread 2: reading data
0	Thread 1: Read row Thread 2: Read row	sees data from original row	sees data from original row
1	Thread 1: Change Row Thread 2: N/A	changes data	N/A
2	Thread 1: Read row Thread 2: Read row	sees new data	sees new data
3	Thread 1: Commit Thread 2: N/A	Commits data	N/A
4	Thread 1: Read row Thread 2: Read row	sees new data	sees new data

So a read uncommitted is the lowest level of isolation in that there is none. This is how statements that are implicit transactions act — you see the data right away, even if what you are doing is logically a transaction. This is also called a “dirty read” because information is seen right away. If the transaction rolled back, and ‘undid’ the change, then the original data would be back for all to see.

More isolated than read uncommitted read is “read committed”, which, as you can guess, means that thread 2 can only read committed changes.

The description in bold is the difference between read uncommitted and read commited:

Isolation Level: Read Committed
#	Step	Thread 1: changing data	Thread 2: reading data
0	Thread 1: Read row Thread 2: Read row	sees data from original row	sees data from original row
1	Thread 1: Change Row Thread 2: N/A	changes data	N/A
2	Thread 1: Read row Thread 2: Read row	sees new data	sees data from the original row
3	Thread 1: Commit Thread 2: N/A	Commits data	N/A
4	Thread 1: Read row Thread 2: Read row	sees new data	sees new data

However, this means that for the duration of thread 2‘s transaction, the data may change. This is not necessarily desirable, and is seen as a ‘dirty’ read because for the duration of thread 2’s transaction, the state of the data changes.

So if thread 1 commits a change, thread 2 does not see it until it reads the changed row in a transaction that started after the commit.

The part in bold is the difference between read uncommitted and repeatable read:

Isolation Level: Repeatable Read
#	Step	Thread 1: changing data	Thread 2: reading data
0	Thread 1: Read row Thread 2: Read row	sees data from original row	sees data from original row
1	Thread 1: Change Row Thread 2: N/A	changes data	N/A
2	Thread 1: Read row Thread 2: Read row	sees new data	sees data from the original row
3	Thread 1: Commit Thread 2: N/A	Commits data	N/A
4	Thread 1: Read row Thread 2: Read row	sees new data	sees data from original table
5	Thread 1: N/A Thread 2: New transaction	Thread 2 starts new transaction	N/A
6	Thread 1: Read row Thread 2: Read row	sees new data	sees new data

This sounds like a good level of isolation. However, there are many cases where a transaction will want the most recent available data. For instance, if two people who have a joint bank account take out money at the same time, having two systems start with the same balance, update the new balance and then write the new balance to the database will cause the bank to lose a lot of money, as only one of the two withdrawals will be reflected.

The serializable isolation level helps prevent this — it does not allow a read of data that is being written to. In other words, there is a serial order to transactions, even reads.

The part in bold is the difference between repeatable read and serializable:

Isolation Level: Serializable
#	Step	Thread 1: changing data	Thread 2: reading data
0	Thread 1: Read row Thread 2: Read row	sees data from original row	sees data from original row
1	Thread 1: Change Row Thread 2: N/A	changes data	N/A
2	Thread 1: Read row Thread 2: Read row	sees new data	is not allowed to start the transaction of reading
3	Thread 1: Commit Thread 2: N/A	Commits data	N/A
4	Thread 1: Read row Thread 2: Read row	sees new data	is allowed to start the read transaction, sees new data
5	Thread 1: N/A Thread 2: New transaction	Thread 2 starts new transaction	N/A
6	Thread 1: Read row Thread 2: Read row	sees new data	sees new data

I have not yet addressed how these restrictions are enforced, that’s what concurrency is.

Concurrency is how the different working environments occur at the same time. Folks who know about version control understand concurrency issues. One way to control concurrency is to use READ and WRITE locks. To achieve a read uncommitted isolation level, there are no locks. For read committed, a read lock on the data allows multiple threads to read the data, but no thread may write to the data; a write lock does not allow reading until the lock is released. For repeatable read, a read lock means that the thread always sees the same data even if it is changed on disk. For serializable, only one lock can be on the data at a time.

Those who deal with version control know that there is a better system than those locks. Imagine not being able to look at a code file while someone was changing it! In many version control systems, each user has a working copy where changes can be made, which is hidden from everyone else until commit. That is multi-version concurrency control (MVCC) — there are multiple versions, all out there at once concurrently. MySQL employes MVCC.

The serializable isolation level makes more sense using the MVCC model — before a software developer commits, they ususally update their files to see if other changes have been made. This does not quite carry over to databases.

Within a multi-statement transaction (for storage engines with support), the database gives a snapshot of the data to the transaction to work with, much like checking out a version. The physical database may be changing with other commits, but much like version control in software, you only see the data you ‘checked out’ for the duration of the transaction.

Therefore, all isolation levels up to repeatable read are easy to enforce, but serializable is more difficult. There is no way to “update” the snapshot of data within a transaction, to use what’s currently in the database; you can only look at the data as it was at the beginning of the transaction (repeatable read isolation level).

So that’s why MVCC in a database is not serializable.

To be clear: You can set the isolation level to be SERIALIZABLE in MySQL.

See:
http://dev.mysql.com/doc/refman/5.0/en/innodb-transaction-isolation.html

and

http://dev.mysql.com/doc/refman/5.0/en/innodb-transaction-isolation.html

A few meetups ago, we had Jim Starkey come and speak. During that evening, he asked “Why is MVCC not serializable?” It was a “go and do your homework” kind of question.

Choosing Datatypes for Fields

Twice recently I have had to import data from a flat file into a database — both sets of data were being imported to make better use of the data in the files. Importing a file is not that difficult using mysqlimport, particularly since in both cases the data was tab-delimited. I also could have used LOAD DATA INFILE.

The problem with mysqlimport and LOAD DATA INFILE is that you need to create the table beforehand. Often, though, I may not know the exact data types. Folks have written about PROCEDURE ANALYSE(), but they use it in the context of “you can check to see that your tables have the correct data types.”

Now, that’s always a good thing — say, quarterly or yearly, doing a schema review, including indexes, foreign keys and data types. However, where PROCEDURE ANALYSE() really comes in handy for me is when I am importing data.

Basically, I create a table with the correct number of fields, all with the datatype of tinytext. Then I run a SELECT * FROM tbl PROCEDURE ANALYSE()\G to find out what types my data really want to be.

Of course, you want to pay attention to warnings when importing, especially truncation warnings. Generally I use this method a few times on the data, and refine the column types a few times before I get it right.

My one complaint is that I wish PROCEDURE ANALYSE() would state the current column datatype as well as the “ideal” data type. It wastes a lot of time for me to make sure that the “ideal” data type is the one I already have, and it should not be that difficult for the function itself to do it.

Open Source Removes the Need for Warez . . . . ?

I was listening to Practical Web Design Magazine’s podcast on Free Web Designer’s Toolkit, and towards the end he said something along the lines of:

With all this great open source software, there’s no need to buy or crack expensive software anymore.

I thought it was an interesting statement. I do not believe WAREZ will ever go away, because many like to have the expensive software as a bragging right — or even boast that they have a cracked version.

However, it does keep the honest people honest. If you want to do some photo editing, you can use the GIMP. If you want to write a document, spreadsheet or presentation, you can use OpenOffice. And these programs are compatible; they can read and save in proprietary formats as well.

It is getting to the point where not only are there excellent open source programs for anything you want to do, there are also “good enough” open source (or closed, but freeware) programs for almost anything. For instance, I was looking into open source versions of money management software (such as Quicken) for corporations and I found some web-based versions, which is great for organizations that want to be able to share information without having multiple software licenses.

Stealing will never go away; however, with less stealing, companies will be able to focus more on the products. And perhaps they’ll realize that if they go open source, “stealing” becomes a different beast, depending on the pricing model. It’s much more difficult to steal a support contract, which may involve a serial # of a machine, than a software key. Or consulting support, for that matter.

What do you think?

I was listening to Practical Web Design Magazine’s podcast on Free Web Designer’s Toolkit, and towards the end he said something along the lines of:

With all this great open source software, there’s no need to buy or crack expensive software anymore.

What do you think?

One TRIGGER Fact, Optimizer Rewriting Stuff

One thing about TRIGGERs is that you can have 6 triggers per table:

BEFORE
AFTER

INSERT
UPDATE
DELETE

That was brought up at the User Group meeting.

In other news, I have been reading the MySQL Internals Manual. The most interesting chapter yet is Chapter 3, the optimizer. This can be read out of order, and actually contains lots of information that folks can use. For instance:

WHERE col1=col2 AND col2=’x’
transforms to
WHERE col1=’x’ AND col2=’x’

because constant matching is faster.

Tautologies and always-false statements are usually optimized out of the query so that something like WHERE 0=0 does not waste time by actually being tested.

A table with 0 or 1 rows is a constant value. Another constant table scenario is WHERE col1=[constant] for all the columns of any UNIQUE or PRIMARY KEYS, assuming they are defined as NOT NULL. This is a good reason to use NOT NULL for column definitions unless you really need NULL.

A constant table is what is used by the optimizer when you see “const” as the “join type” in the result of an EXPLAIN.

WHERE col1 in (1,2,3) is the exact same expression as WHERE col1=1 OR col1=2 OR col1=3

WHERE col1 BETWEEN 1 AND 3 is the exact same expression as WHERE col1>=1 AND col1< =3

The latter 2 examples are good to know, because the first of the two statements in each case are more clear.

One thing about TRIGGERs is that you can have 6 triggers per table:

BEFORE
AFTER

INSERT
UPDATE
DELETE

That was brought up at the User Group meeting.

WHERE col1=col2 AND col2=’x’
transforms to
WHERE col1=’x’ AND col2=’x’

because constant matching is faster.

Tautologies and always-false statements are usually optimized out of the query so that something like WHERE 0=0 does not waste time by actually being tested.

A constant table is what is used by the optimizer when you see “const” as the “join type” in the result of an EXPLAIN.

WHERE col1 in (1,2,3) is the exact same expression as WHERE col1=1 OR col1=2 OR col1=3

WHERE col1 BETWEEN 1 AND 3 is the exact same expression as WHERE col1>=1 AND col1< =3

The latter 2 examples are good to know, because the first of the two statements in each case are more clear.

Open Source Business Model “Ratings”

Back when I posted How “Open” Do You Have To Be To Be Open Source?, Mårten Mickos e-mailed me asking what I thought of various models of charging for open source, and how I thought they ‘rated’.

Or should we just say “whatever business model, as long as it works and users get what they want”?

Q: Why should businesses charge at all?
A: Because people work hard and deserve to get paid for that work, and folks should give up a part of their own hard-earned money because they get value out of that work.

Q: But why do we need to be paid at all?
A: There are expenses to meet — business expenses, and every employee has personal expenses.

In an ideal world, there would be a free exchange of ideas, services and goods. Unfortunately, this only works for small models, such as a family where everyone pitches in. Communism on a larger scale tends to fail, because someone has to “play Root” and control what gets doled out and when.

As someone with strong socialist tendencies, I tend to feel that we should all try to provide for each other when we can. This is why I like open source very much. Imagine a car dealer giving you a car for free, and then charging for repairs and maintenance. Sure, you could think the dealer would be more likely to give you a bad car, so he could make lots of money, but that’s not the mentality behind open source.

So what models are good for open source? Whatever can keep the company going strong and keeping the overall spirit of open source — helping each other — alive. The license that embedded software end-users buy helps pay for the development that everyone benefits from, even application software end-users who do not buy a support contract.

Being a relative newcomer to the stage, but having some business sense, what would I do if I were designing a business model for an open source company? I think the spirit of open source suggests some kind of “giving back” — but how does one determine how much to give back? Who would be in charge of it? Would there be a certain number of questions on any forum or mailing list one would have to answer correctly? Or a certain number of related blog posts? What about a banner on the website that says “powered by”?

Having worked in development (that’s the fundraising type of development, not the coding type of development) I can say that many people use volunteer goods and services and give nothing back. I like to give time and/or money to organizations I find useful, for me or for others. I would hope that those who can, give back, in the ways they can.

However, of course, Marten needs to eat. So I think in the spirit of open source, any of the following are acceptable, so long as the source remains open and the company keeps the needs of the users in mind:

1) Suggesting donations.
2) Requiring licenses for commercial use (if you make money, you have to pass some along to us).
3) Requiring licenses for special types of usage (ie, embedded).
4) Requiring that the technology be visible (ie, banner “powered by…” on the website and explicit mention of the open source technology where appropriate). This model can be combined with a free model — for instance, “if you tell people you use us, it’s free. Otherwise, pay us.” This method appeals to me the most, but of course it would not work for all technologies or software.
5) Charging for support*
6) Charging for consulting*
7) Charging for packaging (ie, Linux with a library of rpms built that virtually guarantees compatibility)
8) “Lite” or “Express” versions of software for free, with full-featured versions for pay — “freeware” or “shareware”. Trial versions that expire do not count.

MySQL has a great model, in my opinion, because you can buy a license if required, you can buy support, you can buy consulting, you can buy training, and you can buy packaging (MySQL Network gets you a more thoroughly tested binary). However, they offer free webinars, free software, free forums and mailing lists….and free access to their developers, who appear on the lists and forums and fix the bugs tracked.

As much as I do not like to part with my hard-earned cash, most medium-to-large businesses can afford $600/year per server (that they want help with) for the most basic MySQL support. I will not bother comparing database software licensing or purchase pricing; that has been done many times over.

Mostly, I think any model that is actually in the spirit of open source (trial versions that expire are not, although I cannot imagine those would be open source, given that you could then change the code that expires the features) is OK, and one that does not charge excessively….although that, of course, is hard to define as well.

* provided that these are not required to get a working setup in place. For instance, if you need support or consulting to get the product working in the first place, that’s akin to just charging for the software.

Or should we just say “whatever business model, as long as it works and users get what they want”?

Q: But why do we need to be paid at all?
A: There are expenses to meet — business expenses, and every employee has personal expenses.