What is an “unauthenticated user”?

Every so often we have a client worrying about unauthenticated users. For example, as part of the output of SHOW PROCESSLIST they will see:

+-----+----------------------+--------------------+------+---------+------+-------+------------------+
| Id  | User                 | Host               | db   | Command | Time | State | Info             |
+-----+----------------------+--------------------+------+---------+------+-------+------------------+
| 235 | unauthenticated user | 10.10.2.74:53216   | NULL | Connect | NULL | login | NULL             |
| 236 | unauthenticated user | 10.120.61.10:51721 | NULL | Connect | NULL | login | NULL             |
| 237 | user                 | localhost          | NULL | Query   | 0    | NULL  | show processlist |
+-----+----------------------+--------------------+------+---------+------+-------+------------------+

Who are these unauthenticated users, how do they get there, and why aren’t they authenticated?

The client-server handshake in MySQL is a 4-step process. Those familiar with mysql-proxy already know these steps, as there are four functions that a Lua script in mysql-proxy can override. The process is useful to know for figuring out exactly where a problem is when something breaks.
(more…)

Every so often we have a client worrying about unauthenticated users. For example, as part of the output of SHOW PROCESSLIST they will see:

+-----+----------------------+--------------------+------+---------+------+-------+------------------+
| Id  | User                 | Host               | db   | Command | Time | State | Info             |
+-----+----------------------+--------------------+------+---------+------+-------+------------------+
| 235 | unauthenticated user | 10.10.2.74:53216   | NULL | Connect | NULL | login | NULL             |
| 236 | unauthenticated user | 10.120.61.10:51721 | NULL | Connect | NULL | login | NULL             |
| 237 | user                 | localhost          | NULL | Query   | 0    | NULL  | show processlist |
+-----+----------------------+--------------------+------+---------+------+-------+------------------+

Who are these unauthenticated users, how do they get there, and why aren’t they authenticated?

The client-server handshake in MySQL is a 4-step process. Those familiar with mysql-proxy already know these steps, as there are four functions that a Lua script in mysql-proxy can override. The process is useful to know for figuring out exactly where a problem is when something breaks.

Step 1: Client sends connect request to server. There is no information here (as far as I can tell). However, it does mean that if you try to connect to a host and port of a mysqld server that is not available, you will get

ERROR 2003 (HY000): Can't connect to MySQL server on '[host]' (111)

Step 2: The server assigns a connection and sends back a handshake, which includes the server’s mysqld version, the thread id, the server host and port, the client host and port, and a “scramble buffer” (for salting authentication, I believe).

It is during Step 2 where the connections show up in SHOW PROCESSLIST. They have not been authenticated yet, but they are connected. If there are issues with authentication, connections will be stuck at this stage. Most often stuck connections are due to DNS not resolving properly, which the skip-name-resolve option will help with.

Step 3: Client sends authentication information, including the username, the password (salted and hashed) and default database to use. If the client sends an incorrect packet, or does not send authentication information within connect_timeout seconds, the server considers the connection aborted and increments its Aborted_connects status variable.

Step 4: Server sends back whether the authentication was successful or not. If the authentication was not successful, mysqld increments its Aborted_connects status variable and sends back an error message:

ERROR 1045 (28000): Access denied for user 'user'@'host' (using password: [YES/NO])

Hope this helps!

Log Buffer #72 — a Carnival of the Vanities for DBAs

Welcome to the 72nd edition of Log Buffer, the weekly review of database blogs.

Oracle OpenWorld (OOW) is over, and Lucas Jellema of the AMIS Technology blog notes the OOW Content Catalog has been updated with most of the presentations available for download.

On his way home from OOW, Chris Muir of the appropriately titled One Size Doesn’t Fit All blog notes how OOW and the Australian Oracle User Group Conference and OOW compare with regards to 99% fewer attendees in AUSOUG Perth conference – from 45k down to 350.

Mark Rittman of Rittman Mead Consulting summarizes OOW’s impact on business intelligence and data warehousing in Reflections on Oracle’s BI Strategy. On his way home, Mark found time for A First Look at Oracle OLAP 11g, noting the pros, cons, gotchas and suggestions for improvement for many useful new features.

Microsoft SQL Server also has a new release in the works. Ted Malone in Agile Methods for the DB Dev is excited about SQL Server 2008 “Katmai” CTP 5 New Features and descries almost 20 of them.

Ian Barwick of PostgreSQL Notes talks about Converting tsearch2 to 8.3 now that the tsearch2 full text search engine has been integrated as a core PostgreSQL feature.

Patrick Barel of the Bar Solutions Weblog explains a new feature of Oracle 11g called Virtual Columns. While virtual data may be a new topic, using databases on virtual machines is an ongoing issue. Marco Russo of SQL BI gives his opinion on when to use virtual machines in SQL Server Virtualization.

Database professionals can be real characters, and set in their ways. Bad puns make good transitions, and Corrado Pandiani sheds light on MySQL’s rules for Charsets and Collations on Multicolumn Fulltext Indexes. Adam Douglas of Binary Expressions fixed some trouble with MySQL and French Characters not rendering properly.

Greg Sabino Mullane shows reasons for his Problems with pl/perl and UTF-8. In Tending the Garden, Selena Deckelmann goes through the very easy process of Automatic Character Set Conversion in PostgreSQL. Selena has also been busy organizing the development of ptop, an interactive, command-line tool for monitoring the current status of a PostgreSQL database. If you read this in time and are in the Portland, Oregon area you can join the ptop hackathon at noon (local time) tomorrow, Saturday November 24th, or you can read the ptop meeting summary from pdxpug.

While some of us are database tools, some of us prefer to contribute database tools. Baron Schwartz honors MySQL’s trademark by announcing that MySQL Toolkit is now Ma’atkit. Ma’at, pronounced “mott”, is the ancient Egyption patron saint of truth, harmony and order. In addition, Baron proclaims “Ma’atkit Version 1297 Released!”

Hubert Lubaczewski notes the changes to the analyze.pgsql.logs.pl script of pgsql-tools in update 3 and update 4.

Hubert also notes how to find overlapping time ranges and how to find the number of ranges a time belongs to in time ranges in postgresql – part 2. Though written for PostgreSQL, both posts can easily be applied to another DBMS. In the same vein, Yves Trudeau shares the DBMS-independent graphical images of Unix memory usage in Generating graphs from vmstat output.

Jeromy McMahon posts sample SQL code for viewing Oracle extent segments for tablespaces, temporary spaces and sort segment space. The Cheap DBA gets Oracle specific with a Slick Shell Script for Reporting on Oracle Workload. Krister Axel of codeboxer.com has A really clean dynamic insert proc for PL/SQL ETL packages, including validation checking and exception handling. zillablog‘s Robert Treat treats us to a function for tracking plperl shared variables.

Jen M is Keeping IT simple by coding capacity measurements to show How Not to Outgrow Your DB Infra: A Simple Step. She follows up with more code to monitor a specific cache to resolve unexplainable slowness/resource leak in SQL Server.

This post began with a conference, and so it shall conclude. The Call For Proposals for PgCon 2008 is underway, and David Fetter lets us know that PgCon 2008 will be held May 22-23 at the University of Ottawa. This is different from Joshua Drake‘s call for volunteers for Command Prompt’s Postgresql Conference East 08, on March 28-29 at the University of Maryland. Neil Conway informs us of a Jim Gray Tribute, consisting of a general session and 9 half-hour technical sessions reviewing some of the 1998 Turing Award winner’s work.

In case this edition did not give you enough to read, Beth Breidenbach of Confessions of a Database Geek created an aggregate blog feed for posts relating to information quality.

Welcome to the 72nd edition of Log Buffer, the weekly review of database blogs.

Oracle OpenWorld (OOW) is over, and Lucas Jellema of the AMIS Technology blog notes the OOW Content Catalog has been updated with most of the presentations available for download.

On his way home from OOW, Chris Muir of the appropriately titled One Size Doesn’t Fit All blog notes how OOW and the Australian Oracle User Group Conference and OOW compare with regards to 99% fewer attendees in AUSOUG Perth conference – from 45k down to 350.

Mark Rittman of Rittman Mead Consulting summarizes OOW’s impact on business intelligence and data warehousing in Reflections on Oracle’s BI Strategy. On his way home, Mark found time for A First Look at Oracle OLAP 11g, noting the pros, cons, gotchas and suggestions for improvement for many useful new features.

Microsoft SQL Server also has a new release in the works. Ted Malone in Agile Methods for the DB Dev is excited about SQL Server 2008 “Katmai” CTP 5 New Features and descries almost 20 of them.

Ian Barwick of PostgreSQL Notes talks about Converting tsearch2 to 8.3 now that the tsearch2 full text search engine has been integrated as a core PostgreSQL feature.

Patrick Barel of the Bar Solutions Weblog explains a new feature of Oracle 11g called Virtual Columns. While virtual data may be a new topic, using databases on virtual machines is an ongoing issue. Marco Russo of SQL BI gives his opinion on when to use virtual machines in SQL Server Virtualization.

Database professionals can be real characters, and set in their ways. Bad puns make good transitions, and Corrado Pandiani sheds light on MySQL’s rules for Charsets and Collations on Multicolumn Fulltext Indexes. Adam Douglas of Binary Expressions fixed some trouble with MySQL and French Characters not rendering properly.

Greg Sabino Mullane shows reasons for his Problems with pl/perl and UTF-8. In Tending the Garden, Selena Deckelmann goes through the very easy process of Automatic Character Set Conversion in PostgreSQL. Selena has also been busy organizing the development of ptop, an interactive, command-line tool for monitoring the current status of a PostgreSQL database. If you read this in time and are in the Portland, Oregon area you can join the ptop hackathon at noon (local time) tomorrow, Saturday November 24th, or you can read the ptop meeting summary from pdxpug.

While some of us are database tools, some of us prefer to contribute database tools. Baron Schwartz honors MySQL’s trademark by announcing that MySQL Toolkit is now Ma’atkit. Ma’at, pronounced “mott”, is the ancient Egyption patron saint of truth, harmony and order. In addition, Baron proclaims “Ma’atkit Version 1297 Released!”

Hubert Lubaczewski notes the changes to the analyze.pgsql.logs.pl script of pgsql-tools in update 3 and update 4.

Hubert also notes how to find overlapping time ranges and how to find the number of ranges a time belongs to in time ranges in postgresql – part 2. Though written for PostgreSQL, both posts can easily be applied to another DBMS. In the same vein, Yves Trudeau shares the DBMS-independent graphical images of Unix memory usage in Generating graphs from vmstat output.

Jeromy McMahon posts sample SQL code for viewing Oracle extent segments for tablespaces, temporary spaces and sort segment space. The Cheap DBA gets Oracle specific with a Slick Shell Script for Reporting on Oracle Workload. Krister Axel of codeboxer.com has A really clean dynamic insert proc for PL/SQL ETL packages, including validation checking and exception handling. zillablog‘s Robert Treat treats us to a function for tracking plperl shared variables.

Jen M is Keeping IT simple by coding capacity measurements to show How Not to Outgrow Your DB Infra: A Simple Step. She follows up with more code to monitor a specific cache to resolve unexplainable slowness/resource leak in SQL Server.

This post began with a conference, and so it shall conclude. The Call For Proposals for PgCon 2008 is underway, and David Fetter lets us know that PgCon 2008 will be held May 22-23 at the University of Ottawa. This is different from Joshua Drake‘s call for volunteers for Command Prompt’s Postgresql Conference East 08, on March 28-29 at the University of Maryland. Neil Conway informs us of a Jim Gray Tribute, consisting of a general session and 9 half-hour technical sessions reviewing some of the 1998 Turing Award winner’s work.

In case this edition did not give you enough to read, Beth Breidenbach of Confessions of a Database Geek created an aggregate blog feed for posts relating to information quality.

Picking Up Where You Left Off……

I started this as a response to Keith Murphy’s post at http://www.paragon-cs.com/wordpress/?p=54, but it got long, so it deserves its own post. The basic context is figuring out how not to cause duplicate information if a large INSERT statement fails before finishing.

Firstly, the surefire way to make sure there are no duplicates if you have a unique (or primary) key is to use INSERT IGNORE INTO.

Secondly, I just experimented with adding an index to an InnoDB table that had 1 million rows, and here’s what I got (please note, this is one experience only, the plural of “anecdote” is *not* “data”; also I did this in this particular order, so there may have been caching taking place):

Way #1:
– ALTER the table to add the new index. This was the slowest method, taking over 13 minutes.

Way #2:
– CREATE a new table with the same schema as the old except for adding the new index
– INSERT INTO newtable SELECT * FROM oldtable;
– ALTER TABLE oldtable RENAME somethingdifferent;
– ALTER TABLE newtable RENAME oldtable;

The ALTER TABLEs happen instantly. This was faster by a few seconds, which is statistically negligible given the 13+ minutes total time.

Way #3:
– mysqldump the table schema only (–no-data) into a file (tableschema.sql).

– mysqldump the table data only (-t) into another file (tabledata.sql).
– optionally pipe into awk to replace “^INSERT INTO” with “INSERT IGNORE INTO”

– edit the table schema file, adding the new index into the table definition
– optionally change the name of the table to something like newtable, making sure to change the DROP TABLE *and* CREATE TABLE statements.

– mysql < tableschema.sql (this will drop the old table unless you changed the name) - mysql < tabledata.sql () - If you changed the table name in the DROP and CREATE statements, run - ALTER TABLE oldtable RENAME somethingdifferent; and ALTER TABLE newtable RENAME oldtable; - Delete the "somethingdifferent" table This way took just over 10 minutes, 3 minutes faster than the other 2 ways, for a time savings of 25%. CAVEAT: MySQL helpfully moves references on a table to the new table name when you ALTER TABLE...RENAME. You will have to adjust your foreign keys, stored procedures, functions and triggers if you use anything other than Way #1. CAVEAT #2: Make sure that the character set of the MySQL server is supported by the MySQL client and the operating system where you're dumping the file to, otherwise special characters can end up falling victim to mojibake.

I started this as a response to Keith Murphy’s post at http://www.paragon-cs.com/wordpress/?p=54, but it got long, so it deserves its own post. The basic context is figuring out how not to cause duplicate information if a large INSERT statement fails before finishing.

Firstly, the surefire way to make sure there are no duplicates if you have a unique (or primary) key is to use INSERT IGNORE INTO.

Secondly, I just experimented with adding an index to an InnoDB table that had 1 million rows, and here’s what I got (please note, this is one experience only, the plural of “anecdote” is *not* “data”; also I did this in this particular order, so there may have been caching taking place):

Way #1:
– ALTER the table to add the new index. This was the slowest method, taking over 13 minutes.

Way #2:
– CREATE a new table with the same schema as the old except for adding the new index
– INSERT INTO newtable SELECT * FROM oldtable;
– ALTER TABLE oldtable RENAME somethingdifferent;
– ALTER TABLE newtable RENAME oldtable;

The ALTER TABLEs happen instantly. This was faster by a few seconds, which is statistically negligible given the 13+ minutes total time.

Way #3:
– mysqldump the table schema only (–no-data) into a file (tableschema.sql).

– mysqldump the table data only (-t) into another file (tabledata.sql).
– optionally pipe into awk to replace “^INSERT INTO” with “INSERT IGNORE INTO”

– edit the table schema file, adding the new index into the table definition
– optionally change the name of the table to something like newtable, making sure to change the DROP TABLE *and* CREATE TABLE statements.

– mysql < tableschema.sql (this will drop the old table unless you changed the name) - mysql < tabledata.sql () - If you changed the table name in the DROP and CREATE statements, run - ALTER TABLE oldtable RENAME somethingdifferent; and ALTER TABLE newtable RENAME oldtable; - Delete the "somethingdifferent" table This way took just over 10 minutes, 3 minutes faster than the other 2 ways, for a time savings of 25%. CAVEAT: MySQL helpfully moves references on a table to the new table name when you ALTER TABLE...RENAME. You will have to adjust your foreign keys, stored procedures, functions and triggers if you use anything other than Way #1. CAVEAT #2: Make sure that the character set of the MySQL server is supported by the MySQL client and the operating system where you're dumping the file to, otherwise special characters can end up falling victim to mojibake.

Top 10 MySQL Best Practices

So, O’Reilly’s ONLamp.com has published the “Top 10 MySQL Best Practices” at http://www.onlamp.com/pub/a/onlamp/2002/07/11/MySQLtips.html. Sadly, I find most “best practice” list do not thoroughly explain the “why” enough so that people can make their own decisions.

For instance, #3 is “Protect the MySQL installation directory from access by other users.” I was intrigued at what they would consider the “installation” directory. By reading the tip, they actually mean the data directory. They say nothing of the log directory, nor that innodb data files may be in different places than the standard myisam data directories.

They perpetuate a myth in #4, “Don’t store binary data in MySQL.” What they really mean is “don’t store large data in MySQL”, which they go into in the tip. While it’s true that there is very little benefit to having binary data in a database, they don’t go into what those benefits are. This means that people can’t make informed decisions, just “the best practice is this so I’m doing it.”

The benefit of putting binary data in MySQL is to be able to associate metadata and other data. For instance, “user 200 owns file 483”. If user 200 is gone from the system, how can you make sure file 483 is as well? There’s no referential integrity unless it’s in the database. While it’s true that in most cases people would rather sacrifice the referential integrity for things like faster database backups and easier partitioning of large data objects, I believe in giving people full disclosure so they can make their own informed decision.

#5 is my biggest pet peeve. “Stick to ANSI SQL,” with the goal being to be able to migrate to a different platform without having to rewrite the code. Does anyone tell Oracle folks not to use pl/sql like collections? Nobody says “SQL is a declarative language, pl/sql is procedural therefore you should never use it”. How about SQL Server folks not to use transact-sql statements like WAITFOR? MATCH… AGAINST is not standard SQL, so I should never use it?

Now, of course, if you’re selling a product to be run on different database platforms, then sure, you want to be platform agnostic. But you’d know that from the start. And if you have to migrate platforms you’re going to have to do lots of work anyway, because there are third-party additions to all the software any way.

And why would *anyone* choose a specific database, and then *not* use those features? I think that it’s a good tip to stick to ANSI SQL if you *know* you want to, or if you have no idea about the DBMS you’re using.

If you want to see how this cripples MySQL, check out Visibone’s SQL chart at: http://www.visibone.com/sql/chart_1200.jpg — you can buy it here: http://sheeri.com/archives/104. I may post later on about my own personal MySQL Best Practices….

So, O’Reilly’s ONLamp.com has published the “Top 10 MySQL Best Practices” at http://www.onlamp.com/pub/a/onlamp/2002/07/11/MySQLtips.html. Sadly, I find most “best practice” list do not thoroughly explain the “why” enough so that people can make their own decisions.

For instance, #3 is “Protect the MySQL installation directory from access by other users.” I was intrigued at what they would consider the “installation” directory. By reading the tip, they actually mean the data directory. They say nothing of the log directory, nor that innodb data files may be in different places than the standard myisam data directories.

They perpetuate a myth in #4, “Don’t store binary data in MySQL.” What they really mean is “don’t store large data in MySQL”, which they go into in the tip. While it’s true that there is very little benefit to having binary data in a database, they don’t go into what those benefits are. This means that people can’t make informed decisions, just “the best practice is this so I’m doing it.”

The benefit of putting binary data in MySQL is to be able to associate metadata and other data. For instance, “user 200 owns file 483”. If user 200 is gone from the system, how can you make sure file 483 is as well? There’s no referential integrity unless it’s in the database. While it’s true that in most cases people would rather sacrifice the referential integrity for things like faster database backups and easier partitioning of large data objects, I believe in giving people full disclosure so they can make their own informed decision.

#5 is my biggest pet peeve. “Stick to ANSI SQL,” with the goal being to be able to migrate to a different platform without having to rewrite the code. Does anyone tell Oracle folks not to use pl/sql like collections? Nobody says “SQL is a declarative language, pl/sql is procedural therefore you should never use it”. How about SQL Server folks not to use transact-sql statements like WAITFOR? MATCH… AGAINST is not standard SQL, so I should never use it?

Now, of course, if you’re selling a product to be run on different database platforms, then sure, you want to be platform agnostic. But you’d know that from the start. And if you have to migrate platforms you’re going to have to do lots of work anyway, because there are third-party additions to all the software any way.

And why would *anyone* choose a specific database, and then *not* use those features? I think that it’s a good tip to stick to ANSI SQL if you *know* you want to, or if you have no idea about the DBMS you’re using.

If you want to see how this cripples MySQL, check out Visibone’s SQL chart at: http://www.visibone.com/sql/chart_1200.jpg — you can buy it here: http://sheeri.com/archives/104. I may post later on about my own personal MySQL Best Practices….

Making Queries 45-90 Times Faster!!

aka…..”when good queries go bad!”

So, today the developers were debugging why a script was running much longer than expected. They were doing text database inserts, and got to the point where they realized that double the amount of text meant the queries took double the amount of time.

You see, they were doing similar text inserts over and over, instead of using connection pooling and/or batching them. Apparently the other DBA explained that it was a limitation of MySQL, but either the developers didn’t convey what they were doing well, or the DBA didn’t think to mention batching.

I ran a simple test on a test server. I used the commandline to connect to a db server on the same machine (even though in qa and production the db machine is on a different machine) just to make a point:

Queries per connect

Type Connects Queries Length of data transmitted Time
One-off 1000 1 619 bytes 12.232s
Single Connection 1 1000 604 kilobytes 0.268s
Batch 1 1 517 kilobytes 0.135s

So 1000 INSERTs using 1 connection is over 45 times faster than 1000 INSERTs using 1000 connections.
Using 1 batch INSERT statement is over 1.75 times faster than using 1 connection.
Using 1 batch INSERT statement is over 90 times faster than 1000 INSERTs using 1000 connections.

Note that while it’s faster to send a batch, if you don’t support sending 517 kilobytes to your database at once, you’ll want to break it up. That’s a small coding price to pay for 90x the database performance!!!

For reference, the formats used:
One-off:
INSERT INTO foo (col1, col2…) VALUES (val1, val2…);

Single Connection:
INSERT INTO foo (col1, col2…) VALUES (val1, val2…);
INSERT INTO foo (col1, col2…) VALUES (val1a, val2a…);

Batch: INSERT INTO foo (col1, col2…) VALUES (val1, val2…), (val1a, val2a);

aka…..”when good queries go bad!”

So, today the developers were debugging why a script was running much longer than expected. They were doing text database inserts, and got to the point where they realized that double the amount of text meant the queries took double the amount of time.

You see, they were doing similar text inserts over and over, instead of using connection pooling and/or batching them. Apparently the other DBA explained that it was a limitation of MySQL, but either the developers didn’t convey what they were doing well, or the DBA didn’t think to mention batching.

I ran a simple test on a test server. I used the commandline to connect to a db server on the same machine (even though in qa and production the db machine is on a different machine) just to make a point:

Queries per connect

Type Connects Queries Length of data transmitted Time
One-off 1000 1 619 bytes 12.232s
Single Connection 1 1000 604 kilobytes 0.268s
Batch 1 1 517 kilobytes 0.135s

So 1000 INSERTs using 1 connection is over 45 times faster than 1000 INSERTs using 1000 connections.
Using 1 batch INSERT statement is over 1.75 times faster than using 1 connection.
Using 1 batch INSERT statement is over 90 times faster than 1000 INSERTs using 1000 connections.

Note that while it’s faster to send a batch, if you don’t support sending 517 kilobytes to your database at once, you’ll want to break it up. That’s a small coding price to pay for 90x the database performance!!!

For reference, the formats used:
One-off:
INSERT INTO foo (col1, col2…) VALUES (val1, val2…);

Single Connection:
INSERT INTO foo (col1, col2…) VALUES (val1, val2…);
INSERT INTO foo (col1, col2…) VALUES (val1a, val2a…);

Batch: INSERT INTO foo (col1, col2…) VALUES (val1, val2…), (val1a, val2a);

OurSQL Episode 22: Things To Avoid With MySQL Queries

Feature:

Things to Avoid in Queries
Subqueries and Correlated subqueries
http://dev.mysql.com/doc/refman/4.1/en/correlated-subqueries.html

Jan Kneschke’s post on Groupwise Maximum:
http://jan.kneschke.de/projects/mysql/groupwise-max

Calculated comparisons do not use indexes

INSERT IGNORE
REPLACE
INSERT…ON DUPLICATE KEY

Feedback:

Email podcast@technocation.org

call the comment line at +1 617-674-2369

use Odeo to leave a voice mail through your computer:
http://odeo.com/sendmeamessage/Sheeri

or use the Technocation forums:
http://technocation.org/forum

Direct play this episode at:
http://technocation.org/content/oursql-episode-22%3A-things-avoid-mysql-queries

Feature:

Things to Avoid in Queries
Subqueries and Correlated subqueries
http://dev.mysql.com/doc/refman/4.1/en/correlated-subqueries.html

Jan Kneschke’s post on Groupwise Maximum:
http://jan.kneschke.de/projects/mysql/groupwise-max

Calculated comparisons do not use indexes

INSERT IGNORE
REPLACE
INSERT…ON DUPLICATE KEY

Feedback:

Email podcast@technocation.org

call the comment line at +1 617-674-2369

use Odeo to leave a voice mail through your computer:
http://odeo.com/sendmeamessage/Sheeri

or use the Technocation forums:
http://technocation.org/forum

Direct play this episode at:
http://technocation.org/content/oursql-episode-22%3A-things-avoid-mysql-queries

MySQL Queues, part II — groups of queues

I believe this is a huge optimization for a heavily implemented Web 2.0 idea.

This article makes simple work of groups of queues. An example of this would be “the most recent 10 people to view an article,” so each article has a queue of up to 10 items in it. This method eliminates the need for multiple SQL statements or using TRIGGERS to check to see if the queue is full.

I bow down to Baron Schwartz, aka Xarpb, for his article on how to implement a queue in SQL:

http://www.xaprb.com/blog/2007/01/11/how-to-implement-a-queue-in-sql/

I am very excited because this also works for groups of objects, and we’re about to implement something at work that needs this idea. The idea of “the most recent x things” or “the top x things” is huge, especially in social networking, and probably one of the most often sought after features.

The biggest issue is that in order to display, say, the most recent posts, a query has to find the time of all the posts and only get the most recent 10. This can be made easy by the logic that the 10 most recent posts are the last 10 rows in the table. Any logic is also added, as in “the last 10 rows in the table viewable and for this guest/login.”

What if you want to track the last 10 people to view the post? Aha, this gets trickier. Convention would say that when a person views a post, have an SQL transaction that adds the information (person x viewed post y at time z and anyo other info, such as browser type, IP, etc) and if there are more than 10 entries for that post, delete the oldest ones until you have 10 entries. This transaction could be done via the application code or via triggers in MySQL 5.0 and up.

However, both those methods use multiple SQL queries, and in the case that an article has been viewed fewer than 10 times, the queries are unnecessary. And given each article has a different popularity — some are viewed lots more than others — running multiple queries ends up being a waste of cycles for articles whose last 10 viewers change infrequently.

These commands were tested on MySQL 4.1.19-standard-log. I use REPLACE INTO because it’s shorter than SELECT…ON DUPLICATE KEY UPDATE, and yes, those aren’t

Let’s say you have a New Year’s Resolution to eat 5 servings of fruits and 5 servings of vegetables per day. The only thing that changes from Baron’s example is that we add a group field (called ‘kind’). The “fruit” field was changed to “edible” and will still contain the name of the edible.

As Baron does, I will use a MySQL-specific command. However, he used SELECT...ON DUPLICATE KEY and I will use REPLACE, as it is smaller in syntax.

use test;
CREATE TABLE q (
id int NOT NULL,
modulo int NOT NULL,
kind char(1) NOT NULL,
food varchar(10) NOT NULL,
PRIMARY KEY(id,kind),
UNIQUE KEY(modulo,kind)
);

The basic statement is below — I’ve added AS clauses to make the variables more clear. The modulus is, in this case, 5, but in the article case above would be 10. The “kind” is either “f” or “v”, these are your groups of queues. In this case they stand for “fruits” and “vegetables” but they might be numbers referring to articles. The “food” stands for the type of food eaten, but in the article scenario would represent the username or user id of the customer viewing the article.

REPLACE INTO q (id, modulo, kind, food)
SELECT
(COALESCE(MAX(id), -1) + 1) AS id,
(COALESCE(MAX(id), -1) + 1) MOD 5 AS modulo,
'f' AS kind,
'apple' AS food
FROM q WHERE kind='f';

mysql> SELECT * FROM q order by kind,id;

id modulo kind food
0 0 f apple

As expected, 1 "fruit" row.

mysql> REPLACE INTO q(id, modulo, kind, food)
-> SELECT
-> (COALESCE(MAX(id), -1) + 1),
-> (COALESCE(MAX(id), -1) + 1) MOD 5,
-> 'f',
-> 'orange'
-> FROM q WHERE kind='f';
Query OK, 1 row affected (0.00 sec)
Records: 1 Duplicates: 0 Warnings: 0

mysql> SELECT * FROM q order by kind,id;

id modulo kind food
0 0 f apple
1 1 f orange

As expected, 2 "fruit" rows.


mysql> REPLACE INTO q(id, modulo, kind, food)
-> SELECT
-> (COALESCE(MAX(id), -1) + 1),
-> (COALESCE(MAX(id), -1) + 1) MOD 5,
-> 'v',
-> 'okra'
-> FROM q WHERE kind='v';
Query OK, 1 row affected (0.00 sec)
Records: 1 Duplicates: 0 Warnings: 0

mysql> SELECT * FROM q order by kind,id;

id modulo kind food
0 0 f apple
1 1 f orange
0 0 v okra

As expected, 2 "fruit" rows and 1 "vegetable" row. Now, let's quickly populate the fields so the "fruit" group reaches it's maximum of 5.

REPLACE INTO q(id, modulo, kind, food)
SELECT
(COALESCE(MAX(id), -1) + 1),
(COALESCE(MAX(id), -1) + 1) MOD 5,
'v',
'squash'
FROM q WHERE kind='v';

REPLACE INTO q(id, modulo, kind, food)
SELECT
(COALESCE(MAX(id), -1) + 1),
(COALESCE(MAX(id), -1) + 1) MOD 5,
'f',
'peach'
FROM q WHERE kind='f';

REPLACE INTO q(id, modulo, kind, food)
SELECT
(COALESCE(MAX(id), -1) + 1),
(COALESCE(MAX(id), -1) + 1) MOD 5,
'f',
'cherries'
FROM q WHERE kind='f';

REPLACE INTO q(id, modulo, kind, food)
SELECT
(COALESCE(MAX(id), -1) + 1),
(COALESCE(MAX(id), -1) + 1) MOD 5,
'f',
'pear'
FROM q WHERE kind='f';

REPLACE INTO q(id, modulo, kind, food)
SELECT
(COALESCE(MAX(id), -1) + 1),
(COALESCE(MAX(id), -1) + 1) MOD 5,
'v',
'celery'
FROM q WHERE kind='v';

SELECT * FROM q order by kind,id;

id modulo kind food
0 0 f apple
1 1 f orange
2 2 f peach
3 3 f cherries
4 4 f pear
0 0 v okra
1 1 v squash
2 2 v celery

We have 5 values in the “fruit” group and 3 values in the “veggie” group. Now let’s see what happens when another fruit is added:

REPLACE INTO q(id, modulo, kind, food)
SELECT
(COALESCE(MAX(id), -1) + 1),
(COALESCE(MAX(id), -1) + 1) MOD 5,
'f',
'banana'
FROM q WHERE kind='f';
Query OK, 2 rows affected (0.00 sec)
Records: 1 Duplicates: 1 Warnings: 0

Note that a duplicate has been found! This is because the modulo wrapped around. The id of “banana” is 5, and 5 modulo 5 = 0 – the same as 0 modulo 5, which was the modulo value previously taken by “apple”. So “apple” is pushed off the end of the queue.

SELECT * FROM q order by kind,id;

id modulo kind food
1 1 f orange
2 2 f peach
3 3 f cherries
4 4 f pear
0 5 f banana
0 0 v okra
1 1 v squash
2 2 v celery

To find the current list of all fruits, with the most recent fruit first, run:

SELECT * FROM q WHERE kind='f' ORDER BY id DESC;

id modulo kind food
1 1 f orange
2 2 f peach
3 3 f cherries
4 4 f pear
0 5 f banana

Let’s get back to the example of page views, though. We probably care about when the pages were viewed, so let’s add a timestamp:

ALTER TABLE q ADD COLUMN fed TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP;

I ran the queries again, with some delays, so the timestamps wouldn’t all be the same.

SELECT * FROM q order by id,kind;

id modulo kind food fed
1 1 f orange 2007-01-15 14:48:25
2 2 f peach 2007-01-15 14:48:28
3 3 f cherries 2007-01-15 14:48:28
4 4 f pear 2007-01-15 14:48:31
5 0 f banana 2007-01-15 14:48:34
1 1 v squash 2007-01-15 14:48:28
2 2 v celery 2007-01-15 14:48:31
3 3 v beet 2007-01-15 14:48:31
4 4 v spinach 2007-01-15 14:48:34
5 0 v cucumber 2007-01-15 14:48:34

Or, what the query would be in a real system — find all fruits eaten and sort by time, most recent first:

SELECT food,fed FROM q WHERE kind=’f’ ORDER BY fed DESC;

banana 2007-01-15 14:48:34
pear 2007-01-15 14:48:31
peach 2007-01-15 14:48:28
cherries 2007-01-15 14:48:28
orange 2007-01-15 14:48:25

edit:
One edge case to be aware of — reaching the limit of the id field.

If your application does 100 of these per second, an UNSIGNED INT will last 1.36 years — not so great. You should use an UNSIGNED INT because you’re never going to have a negative number, and a signed int would only last just over 8 months if there were 100 REPLACEs per second.

(60 seconds/min, 60 min/hr, 24 hrs/day, 365 days/yr)
4294967295/60/60/24/365/100=1.3619251950

So, for scaling/known high traffic, use a BIGINT. However, in this case, do NOT use UNSIGNED, as all MySQL arithmetic is done with signed bigints or doubles. Not that it matters in this case; at 100 REPLACEs per second, you will wrap at 2.9 billion years:

9223372036854775807/60/60/24/365/100=2,924,712,086.7753601074

Let’s say your system does 10,000 of these REPLACEs per second, for eternity (our main database system, where we’re about to use this, average 6,000 qps, not all writes, but it’s a good figure to use for our own numbers) — move the decimal places a few spots over and you’re down to running out of numbers in 29 million years.

That’s an OK limit for us. 🙂

I believe this is a huge optimization for a heavily implemented Web 2.0 idea.

This article makes simple work of groups of queues. An example of this would be “the most recent 10 people to view an article,” so each article has a queue of up to 10 items in it. This method eliminates the need for multiple SQL statements or using TRIGGERS to check to see if the queue is full.

I bow down to Baron Schwartz, aka Xarpb, for his article on how to implement a queue in SQL:

http://www.xaprb.com/blog/2007/01/11/how-to-implement-a-queue-in-sql/

I am very excited because this also works for groups of objects, and we’re about to implement something at work that needs this idea. The idea of “the most recent x things” or “the top x things” is huge, especially in social networking, and probably one of the most often sought after features.

The biggest issue is that in order to display, say, the most recent posts, a query has to find the time of all the posts and only get the most recent 10. This can be made easy by the logic that the 10 most recent posts are the last 10 rows in the table. Any logic is also added, as in “the last 10 rows in the table viewable and for this guest/login.”

What if you want to track the last 10 people to view the post? Aha, this gets trickier. Convention would say that when a person views a post, have an SQL transaction that adds the information (person x viewed post y at time z and anyo other info, such as browser type, IP, etc) and if there are more than 10 entries for that post, delete the oldest ones until you have 10 entries. This transaction could be done via the application code or via triggers in MySQL 5.0 and up.

However, both those methods use multiple SQL queries, and in the case that an article has been viewed fewer than 10 times, the queries are unnecessary. And given each article has a different popularity — some are viewed lots more than others — running multiple queries ends up being a waste of cycles for articles whose last 10 viewers change infrequently.

These commands were tested on MySQL 4.1.19-standard-log. I use REPLACE INTO because it’s shorter than SELECT…ON DUPLICATE KEY UPDATE, and yes, those aren’t

Let’s say you have a New Year’s Resolution to eat 5 servings of fruits and 5 servings of vegetables per day. The only thing that changes from Baron’s example is that we add a group field (called ‘kind’). The “fruit” field was changed to “edible” and will still contain the name of the edible.

As Baron does, I will use a MySQL-specific command. However, he used SELECT...ON DUPLICATE KEY and I will use REPLACE, as it is smaller in syntax.

use test;
CREATE TABLE q (
id int NOT NULL,
modulo int NOT NULL,
kind char(1) NOT NULL,
food varchar(10) NOT NULL,
PRIMARY KEY(id,kind),
UNIQUE KEY(modulo,kind)
);

The basic statement is below — I’ve added AS clauses to make the variables more clear. The modulus is, in this case, 5, but in the article case above would be 10. The “kind” is either “f” or “v”, these are your groups of queues. In this case they stand for “fruits” and “vegetables” but they might be numbers referring to articles. The “food” stands for the type of food eaten, but in the article scenario would represent the username or user id of the customer viewing the article.

REPLACE INTO q (id, modulo, kind, food)
SELECT
(COALESCE(MAX(id), -1) + 1) AS id,
(COALESCE(MAX(id), -1) + 1) MOD 5 AS modulo,
'f' AS kind,
'apple' AS food
FROM q WHERE kind='f';

Continue reading “MySQL Queues, part II — groups of queues”

Statistics Gathering Script

I posted some code to the MySQL Forge at:

http://forge.mysql.com/snippets/view.php?id=52

Basically I gather some stats on my webservers in a very crude way. It runs a bunch of commands on the commandline (I run it every 5 minutes from a cron script). Please note that it adds to the load of the database server by running these commands, and it connects a few times to the database, so it actually increases things like “total # of connections” by the sheer fact that it runs.

This should run on most Unix machines, running most versions of MySQL. Please comment to this entry if you find something wrong or an incompatibility.

An explanation of the code follows:

This script will produce a string that looks like this:
(‘value’,’value’,’value’,123,’value’),
and all you have to do is add a line at the top: “INSERT INTO table (field1,field2,field3,field4,field5) VALUES”

and change the last comma to a semicolon. Then you will have something suitable for entering into a database, for later processing.

So here is an explanation of the lines of code, in order — the first few set variables:
#!/bin/sh
DB=db_here
USER=user_here
PASS=pass_here
HOST=`/bin/hostname | /bin/cut -f1 -d\.`

/bin/echo \(\'$HOST\',\'\

“echo” sets up the fact that you are printing all this out (later you >> to a file). \ is the escape character, and the line starts with “(“. Then we put the hostname in quotes, print a comma, and then an open quote. The final \ means “continue on the next line”, and it shows up on every line hereafter, so we can build a long string and then save it to a file.

`/bin/date +%Y-%m-%d\ %T`\',\

Putting “ around a command ensures that it gets run first, and then echo uses the output. Otherwise echo would simply print out “/bin/date +%Y…..”

This gets the date and time in MySQL’s date format (‘YYYY-MM-DD HH:MM:SS’), and puts in the ‘ to end the quoted field and a comma.

`/usr/bin/mysql -u $USER -p$PASS -e "\s"| /bin/awk '/Threads/ {print $2","$7","$20}'`,\

If you type \s at the database prompt, you will get some information that looks like this:


mysql> \s
--------------
mysql Ver 12.22 Distrib 4.0.20, for sun-solaris2.8 (sparc)
Connection id: 296486
Current database:
Current user: root@localhost
SSL: Not in use
Current pager: stdout
Using outfile: ''
Server version: 4.0.20-log
Protocol version: 10
Connection: Localhost via UNIX socket
Client characterset: latin1
Server characterset: latin1
UNIX socket: /tmp/mysql.sock
Uptime: 57 days 8 hours 6 min 25 sec
Threads: 1 Questions: 9331360 Slow queries: 32 Opens: 19372 Flush tables: 1 Open tables: 64 Queries per second avg: 1.884
--------------

This gets fields 2, 7 and 20 (fields are delimited by spaces) from the line marked “Threads” — field 2 is “1” relating to # of threads, field 7 is “32” which refers to the number of slow queries, and field 20 is “1.884” which referes to “Queries per second avg”. So this line gets 3 values. Since they are numbers, they do not need to be quoted*.

`cat /proc/loadavg | /bin/awk '{print $1}'`,\

/proc has some great information in it, but you have to use cat to view them. This gets the current load average, that appears as the first argument to “uptime”. Again, this is a number and does not need to be quoted.*

`cat /proc/meminfo | /bin/awk '/MemTotal|MemFree|SwapTotal|SwapFree/ {print $2","}'` \

This prints the 2nd field (and a comma separation) from /proc/meminfo from the lines that deal with total memory, free memory, total swap and free swap space. Again, these are numbers and do not require quoting.*

`/usr/bin/mysql -u $USER -p$PASS $DB -e "show table status\G" | /bin/awk '/InnoDB free:/ {print $4}' | /usr/bin/head -1`,\

This is a very hacky way to find how much space is left in the InnoDB tablespace. I use one set of InnoDB data files for all the InnoDB tables I have, so I can just take the free space from the first table, and that is OK. If you are using 1 file per table, you will want to get each value.

`/usr/bin/mysql -u $USER -p$PASS $DB -e "show status" | /bin/awk '/Bytes_received\W|Bytes_sent\W|Com_delete\W|Com_insert\W|Com_replace\W|Com_select\W|Com_update\W|Connections\W|Questions\W|Uptime\W/ {printf $2 ","}'`\

This gets lots of information from the SHOW STATUS command of MySQL. The Com_* commands are the counts, so Com_insert is the count of inserts (since the database instance started, so you can do Com_insert/Uptime to get inserts/second if you want).

Connections is the total number of connections (not to be confused with Threads, which is the current # of threads connected). Questions is total number of queries.

I wrote this in fits and starts, and above I parsed some output of “\s”. However, those numbers can be gotten by the “SHOW STATUS” variables of Slow_queries, Threads_created and Questions/Uptime. Because I am lazy, I have not changed my scripts, although you could decrease the amount of data MySQL sends and lessen the impact of this script with one less connection if you added it to this line.

There are TONS of variables in “SHOW STATUS” that are interesting and you might want to track. Decide what you like…and you can always add them in later!

`df -k /var | /bin/awk '/dev/ {print $2 "," $4}'`,\

This just looks at the space of /var, which is where my company keeps our mysql data files.

`df -k /var/log | /bin/awk '/dev/ {print $2 "," $4}'`,\

This just looks at the space of /var/log — where my company keeps our mysql logfiles.

`/usr/bin/mysql -u $USER -p$PASS $DB -e "SELECT COUNT(DISTINCT uid) FROM Sessions\G" | /bin/awk '/COUNT/ {print $3}' `,\

This gets information from a particular table in our system — it shows how many people are online. This is useful for us to gather because now we can see how the statistics we gather are related to how many people are logged in.

`/usr/bin/iostat -dk | /bin/awk '/sda/ {print $2","$5","$6}'`\),\

This requires the “sysstat” package, and this line retrieves the I/O (disk) transactions per second, the KB read and the KB written to the disk.

>> /home/care/`echo $HOST`perf.csv

And this is the last line — finally, at last, we save it to a file.

PLEASE feel free to modify this and make it cleaner……obviously you will want to apply it to your own situation and what you want to gather.

I run this as a local script on each database machine, because it gets information using the commandline. I save everything to a file and do a weekly import. Theoretically I could have it save directly to a reporting database, but I do not want any dependence on network, just in case. You may want to have the process more automatic.

Weekly I have a cron job to “rotate” the .csv file: copy the file locally, copy it to a remote server, and then delete all but the first line (the first line being “INSERT INTO tbl VALUES(…..”).

(for those interested, this is how I delete all but the first line. Oh so hacky….)
HOST=`/bin/hostname | /bin/cut -f1 -d\.`
head -1 /home/care/`echo $HOST`perf.csv > /tmp/temp; mv /tmp/temp /home/care/`echo $HOST`perf.csv;

And for those who care, the following is the first line of my file:

REPLACE INTO dbstats (serverName,statsTime,procs,slowQueries,queriespersec,serverLoad,MemTotal
,MemFree,SwapTotal,SwapFree,ibFree,bytesRec,bytesSent,deletes,inserts,replaces,selects,updates
,connections,queries,mysqlUptime,sizeKBvar,availKBvar,sizeKBvarlog,availKBvarlog,Sessions,tps,
kbRead,kbWrtn) VALUES


* while numbers do not need to be quoted, if there’s ever a chance that the number may end up blank instead of 0, it should be quoted, as MySQL does not handle “VALUES (‘foo’,,’bar’)”
and throws an error.

I posted some code to the MySQL Forge at:

http://forge.mysql.com/snippets/view.php?id=52

Basically I gather some stats on my webservers in a very crude way. It runs a bunch of commands on the commandline (I run it every 5 minutes from a cron script). Please note that it adds to the load of the database server by running these commands, and it connects a few times to the database, so it actually increases things like “total # of connections” by the sheer fact that it runs.

This should run on most Unix machines, running most versions of MySQL. Please comment to this entry if you find something wrong or an incompatibility.

An explanation of the code follows:
Continue reading “Statistics Gathering Script”

The Trend of Managed Schemas: A Database is Not a Messaging System

This thread on the Boston MySQL User Group Board is getting interesting:
http://mysql.meetup.com/137/boards/view/viewthread?thread=2280640

(From the original poster:)

I think that nonequivalence comes from the past when the data sharing was a
rare situation. The data changes were always initiated by application and it
always knew about those changes. Now the situation is different. When the data
are shared between multiple remote applications you have to notify other
interested parties about those changes.

Currently databases are mostly used as “pull” components. If they had standard
“push” functionality they could compete with messaging systems with the advantages
of automatic data persistence and powerful query language.

(my response:)
Well, and that’s the problem — the paradigm *has* changed. MySQL is fast and reliable because it does NOT put things like messaging into their database, which Oracle and SQL Server do. A database is not a messaging system, it’s a database.

What effect would notification that there have been changes have on MVCC? I do wish there was a “pull” way to check if the data has changed.

The paradigm change of the application managing the schema causes this. I do not believe messaging is the correct way to handle this problem.

Consider the parallel to source code version control. Much like MVCC, you check out code, change it, and commit the code. Unlike many source code version control systems, though, MVCC (“data version control”) does not have the equivalent of an “update” command, except for doing another pull from the database. It would be great if there was an easy way to do a “diff” of what’s in the database versus what the application is changing, but that seems like it would be a programmatic thing (function or method), not a database thing.

And consider the database overhead and bandwidth….instead of just running queries, MySQL would have to somehow keep track of which thread has what data, and then notify every single thread that has that data, that it’s changed. The applications will have to be written to keep threads open longer, which will consume lots of resources. That’s lots more overhead for the database, and much more bandwidth, because there may be instances of the application that are using data that they do not care if it changed….so the messaging system would be wasting bandwidth, sending messages to instances that do not care. Although that could be mitigated by the application keeping a thread open when it cares about whether or not the data has changed.

Then again, I’m not fond of managed schema in the application…or at least, when the developers write that code. Seems to me it should be the DBA writing that code. It’s *very* useful for data consistency and integrity, which is a function of the DBA, not a developer.

What effects do you see the managed schema having on databases? Who should be responsible for writing a managed schema? Should a managed schema be used for database consistency within an application? Where is the line drawn between the application putting the required information into the database, and the database’s job of maintaining consistency and integrity?

It’s somewhat ironic, since for a long time MySQL advocated using the application to ensure the consistency and integrity (ie, before MySQL had a storage engine with foreign keys and transactions).

I often say that the biggest reason MySQL is a widely used database is because it is fast. A fast database can be complemented by an application that adds the features the database is missing; but a slow database that is full-featured cannot be made faster by an application. So it worries me when folks request very specialized systems such as a messaging server (or ANY “push” system) into the database, because that could be easily done with a “pull” mechanism, only using the bandwidth needed by the instances of the applications that care. Otherwise, it will end up adding Microsoft-level bloat to a really nice and fast program.

This thread on the Boston MySQL User Group Board is getting interesting:
http://mysql.meetup.com/137/boards/view/viewthread?thread=2280640

(From the original poster:)

I think that nonequivalence comes from the past when the data sharing was a
rare situation. The data changes were always initiated by application and it
always knew about those changes. Now the situation is different. When the data
are shared between multiple remote applications you have to notify other
interested parties about those changes.

Currently databases are mostly used as “pull” components. If they had standard
“push” functionality they could compete with messaging systems with the advantages
of automatic data persistence and powerful query language.

(my response:)
Well, and that’s the problem — the paradigm *has* changed. MySQL is fast and reliable because it does NOT put things like messaging into their database, which Oracle and SQL Server do. A database is not a messaging system, it’s a database.

What effect would notification that there have been changes have on MVCC? I do wish there was a “pull” way to check if the data has changed.

The paradigm change of the application managing the schema causes this. I do not believe messaging is the correct way to handle this problem.

Consider the parallel to source code version control. Much like MVCC, you check out code, change it, and commit the code. Unlike many source code version control systems, though, MVCC (“data version control”) does not have the equivalent of an “update” command, except for doing another pull from the database. It would be great if there was an easy way to do a “diff” of what’s in the database versus what the application is changing, but that seems like it would be a programmatic thing (function or method), not a database thing.

And consider the database overhead and bandwidth….instead of just running queries, MySQL would have to somehow keep track of which thread has what data, and then notify every single thread that has that data, that it’s changed. The applications will have to be written to keep threads open longer, which will consume lots of resources. That’s lots more overhead for the database, and much more bandwidth, because there may be instances of the application that are using data that they do not care if it changed….so the messaging system would be wasting bandwidth, sending messages to instances that do not care. Although that could be mitigated by the application keeping a thread open when it cares about whether or not the data has changed.

Then again, I’m not fond of managed schema in the application…or at least, when the developers write that code. Seems to me it should be the DBA writing that code. It’s *very* useful for data consistency and integrity, which is a function of the DBA, not a developer.

What effects do you see the managed schema having on databases? Who should be responsible for writing a managed schema? Should a managed schema be used for database consistency within an application? Where is the line drawn between the application putting the required information into the database, and the database’s job of maintaining consistency and integrity?

It’s somewhat ironic, since for a long time MySQL advocated using the application to ensure the consistency and integrity (ie, before MySQL had a storage engine with foreign keys and transactions).

I often say that the biggest reason MySQL is a widely used database is because it is fast. A fast database can be complemented by an application that adds the features the database is missing; but a slow database that is full-featured cannot be made faster by an application. So it worries me when folks request very specialized systems such as a messaging server (or ANY “push” system) into the database, because that could be easily done with a “pull” mechanism, only using the bandwidth needed by the instances of the applications that care. Otherwise, it will end up adding Microsoft-level bloat to a really nice and fast program.

Stopwords in Multiple/Other Languages

Life is slowly returning to order once again. I am attempting to slog through almost 1,000 messages in my MySQL folder, most of which are list questions that have already been answered, so it does not take long to get through them. However, occasionally I find a question that has not been answered, or a gem of a question that I want to expose to a wider audience.

This question fell under both categories. Basically, someone wanted stopwords in other languages, and wondered if there was a place to get them. (English stopwords can be found at http://dev.mysql.com/doc/refman/5.0/en/fulltext-stopwords.html.)

I did a quick web search and found a site that has a bunch of language stopwords:
http://www.ranks.nl/stopwords/

Catalan stopwords
Czech stopwords
Danish stopwords
Dutch stopwords
French stopwords
English stopwords (default)
German stopwords
Hungarian stopwords
Italian stopwords
Norwegian stopwords
Polish stopwords
Portugese stopwords
Spanish stopwords
Turkish stopwords

I hope this helps some folks…..

And now, a tricker question — if there are folks doing fulltext matching in other languages, what is your list of stopwords or where did you get it from? (I am very sure there are tons of sites in the native langauge that lists stopwords, but for admins that do not speak every language their application supports, they can be hard to find!)

What about folks doing searches within a field that may contain multiple languages? Have you created a file to include the stopwords all languages that your application supports? If you have not, should you?

(The documentation on how to change the stopword file parameter is at http://dev.mysql.com/doc/refman/5.0/en/fulltext-fine-tuning.html)

Life is slowly returning to order once again. I am attempting to slog through almost 1,000 messages in my MySQL folder, most of which are list questions that have already been answered, so it does not take long to get through them. However, occasionally I find a question that has not been answered, or a gem of a question that I want to expose to a wider audience.

This question fell under both categories. Basically, someone wanted stopwords in other languages, and wondered if there was a place to get them. (English stopwords can be found at http://dev.mysql.com/doc/refman/5.0/en/fulltext-stopwords.html.)

I did a quick web search and found a site that has a bunch of language stopwords:
http://www.ranks.nl/stopwords/

Catalan stopwords
Czech stopwords
Danish stopwords
Dutch stopwords
French stopwords
English stopwords (default)
German stopwords
Hungarian stopwords
Italian stopwords
Norwegian stopwords
Polish stopwords
Portugese stopwords
Spanish stopwords
Turkish stopwords

I hope this helps some folks…..

And now, a tricker question — if there are folks doing fulltext matching in other languages, what is your list of stopwords or where did you get it from? (I am very sure there are tons of sites in the native langauge that lists stopwords, but for admins that do not speak every language their application supports, they can be hard to find!)

What about folks doing searches within a field that may contain multiple languages? Have you created a file to include the stopwords all languages that your application supports? If you have not, should you?

(The documentation on how to change the stopword file parameter is at http://dev.mysql.com/doc/refman/5.0/en/fulltext-fine-tuning.html)