Japanese Character Set

There are too many Japanese characters to be able to use one byte to handle all of them.

Hiragana — over 50 characters

Katakana — over 50 characters

Kanji — over 6,000 characters

So the Japanese Character set has to be multi-byte. JIS=Japan Industrial Standard, this specifies it.

JIS X 0208 in 1990, updated in 1997 — covers widely used characters, not all characters
JIS X 0213 in 2000, updated in 2004

There are also vendor defined Japanese charsets — NEC Kanji and IBM Kanji — these supplement JIS X 0208.

Cellphone specific symbols have been introduced, so the # of characters is actually increasing!

For JIS X 0208, there are multiple encodings — Shift_JIS (all characters are 2 bytes), EUC-JP (most are 2 bytes, some are 3 bytes), and Unicode (all characters are 3 bytes, this makes people not want to use UTF-8 for ). Shift_JIS is most widely used, but they are moving to Unicode gradually (Vista is using UTF-8 as the standard now). Each code mapping is different, with different hex values for the same character in different encodings.

Similarly, there are different encodings for the other charsets.

MySQL supports only some of these. (get the graph from the slides)

char_length() returns the length by # of characters, length() returns the length by # of bytes.

The connection charset and the server charset have to match otherwise…mojibake!

Windows — Shift_JIS is standard encoding, linux EUC-JP is standard. So conversion may be needed.

MySQL Code Conversion algorithm — UCS-2 facilitates conversion between encodings. MySQL converts mappings to and from UCS-2. If client and server encoding are the same, there’s no conversion. If the conversion fails (ie, trying to convert to latin1), the character is converted to ? and you get mojibake.

You can set a my.cnf paramater for “skip-character-set-client-handshake”, this forces the use of the server side charset (for the column(s) in question).

Issues:

Unicode is supposed to support worldwide characters.

UCS-2 is 2-byte fixed length, takes 2^16 = 65,536 characters. This is one Basic Multilingual Plane (BMP). Some Japanese (and Chinese) characters are not covered by UCS-2. Windows Visa supports JIS X 0213:2004 as a standard character set in Japan (available for Windows XP with the right )

UCS-4 is 4-byte fixed length, can encode 2^31 characters (~2 billion) This covers many BMP’s (128?)

UTF-16 is 2 or 4 byte length, all UCS-2 are mapped to 2 bytes, not all UCS-4 characters are supported — 1 million are. Supported UCS-4 characters are mapped to 4 bytes

UTF-8 from 1-6 bytes is fully compliant with UCS-4. This is out of date. 1-4 byte UTF-8 is fully compliant with UTF-16. From 1-3 bytes, UTF-8 is compliant with UCS-2.

MySQL interally handles all characters as UCS-2, UCS-4 is not supported. This is not enough. Plus, UCS-2 is not supported for client encoding. UTF-8 support is up to 3 bytes — this is not just a MySQL problem though.

CREATE TABLE t1 (c1 VARCHAR(30)) CHARSET=utf8;
INSERT INTO T1 VALUES (0x6162F0A0808B63646566); — this inserts ‘ab’ + 4-byte UTF-8 translation of cdef

SELECT c1,HEX(c1) from t1;
if you get ab,6162 back it means that the invalid character was truncated. MySQL does throw up a warning for this.

Possible workarounds — using VARBINARY/BLOB types. Can store any binary data, but this is always case-sensitive (and yes, Japanese characters do have case). FULLTEXT index is not supported, and application code may need to be modified to handle UTF-8 — ie, String.getBytes may need “UTF-8” parameter in it.

Alternatively, use UCS-2 for column encoding:

CREATE TABLE t1 (c1 VARCHAR(30)) CHARSET=ucs2;

INSERT INTO t1 VALUES (_utf8 0x6162F0A0808B63646566);

SELECT … now gives you ?? instead of truncating.

Another alternative: use Shift_JIS or EUC-JP. Code conversion of JIS X 0213:2004 characters is not currently supported.

Shift_JIS is the most widely used encoding, 1 or 2 byte encoding. All ASCII and 1/2 width katakana are 1-byte, the rest are 2-byte. If the first byte value is between 0x00 and 0x7F it’s ASCII 1 byte, 0XA0 – 0XDf is 1-byte, 1/2 width katakana. all the rest are 2-byte characters.

The 2nd byte might be in the ASCII graphic code area 0x40 for example.

0x5C is the escape sequence (backslash in the US). Some Shift_JIS characters contain 0x5C in the 2nd byte. If the charset is specified incorrectly, you’ll end up getting different values — for instance, hex value 0X5C6e will conver to hex value 0X0A. The backslash at the end of the string, hex value 0X5C, will be removed (truncated) if charset is specified incorrectly.

Native MySQL does not support FULLTEXT search in Japanese, Korean and Chinese (CJK issue).

Japanese words do not delimit by space, so it can’t work. 2 ways to do this: dictionary based indexing, dividing words using a pre-installed dictionary. Also N-gram indexing — divide text by N letters (n could be 1, 2, 3 etc). MySQL + Senna implements this, supported by Sumisho Computer Systems.

There are too many Japanese characters to be able to use one byte to handle all of them.

Hiragana — over 50 characters

Katakana — over 50 characters

Kanji — over 6,000 characters

So the Japanese Character set has to be multi-byte. JIS=Japan Industrial Standard, this specifies it.

JIS X 0208 in 1990, updated in 1997 — covers widely used characters, not all characters
JIS X 0213 in 2000, updated in 2004

There are also vendor defined Japanese charsets — NEC Kanji and IBM Kanji — these supplement JIS X 0208.

Cellphone specific symbols have been introduced, so the # of characters is actually increasing!

For JIS X 0208, there are multiple encodings — Shift_JIS (all characters are 2 bytes), EUC-JP (most are 2 bytes, some are 3 bytes), and Unicode (all characters are 3 bytes, this makes people not want to use UTF-8 for ). Shift_JIS is most widely used, but they are moving to Unicode gradually (Vista is using UTF-8 as the standard now). Each code mapping is different, with different hex values for the same character in different encodings.

Similarly, there are different encodings for the other charsets.

MySQL supports only some of these. (get the graph from the slides)

char_length() returns the length by # of characters, length() returns the length by # of bytes.

The connection charset and the server charset have to match otherwise…mojibake!

Windows — Shift_JIS is standard encoding, linux EUC-JP is standard. So conversion may be needed.

MySQL Code Conversion algorithm — UCS-2 facilitates conversion between encodings. MySQL converts mappings to and from UCS-2. If client and server encoding are the same, there’s no conversion. If the conversion fails (ie, trying to convert to latin1), the character is converted to ? and you get mojibake.

You can set a my.cnf paramater for “skip-character-set-client-handshake”, this forces the use of the server side charset (for the column(s) in question).

Issues:

Unicode is supposed to support worldwide characters.

UCS-2 is 2-byte fixed length, takes 2^16 = 65,536 characters. This is one Basic Multilingual Plane (BMP). Some Japanese (and Chinese) characters are not covered by UCS-2. Windows Visa supports JIS X 0213:2004 as a standard character set in Japan (available for Windows XP with the right )

UCS-4 is 4-byte fixed length, can encode 2^31 characters (~2 billion) This covers many BMP’s (128?)

UTF-16 is 2 or 4 byte length, all UCS-2 are mapped to 2 bytes, not all UCS-4 characters are supported — 1 million are. Supported UCS-4 characters are mapped to 4 bytes

UTF-8 from 1-6 bytes is fully compliant with UCS-4. This is out of date. 1-4 byte UTF-8 is fully compliant with UTF-16. From 1-3 bytes, UTF-8 is compliant with UCS-2.

MySQL interally handles all characters as UCS-2, UCS-4 is not supported. This is not enough. Plus, UCS-2 is not supported for client encoding. UTF-8 support is up to 3 bytes — this is not just a MySQL problem though.

CREATE TABLE t1 (c1 VARCHAR(30)) CHARSET=utf8;
INSERT INTO T1 VALUES (0x6162F0A0808B63646566); — this inserts ‘ab’ + 4-byte UTF-8 translation of cdef

SELECT c1,HEX(c1) from t1;
if you get ab,6162 back it means that the invalid character was truncated. MySQL does throw up a warning for this.

Possible workarounds — using VARBINARY/BLOB types. Can store any binary data, but this is always case-sensitive (and yes, Japanese characters do have case). FULLTEXT index is not supported, and application code may need to be modified to handle UTF-8 — ie, String.getBytes may need “UTF-8” parameter in it.

Alternatively, use UCS-2 for column encoding:

CREATE TABLE t1 (c1 VARCHAR(30)) CHARSET=ucs2;

INSERT INTO t1 VALUES (_utf8 0x6162F0A0808B63646566);

SELECT … now gives you ?? instead of truncating.

Another alternative: use Shift_JIS or EUC-JP. Code conversion of JIS X 0213:2004 characters is not currently supported.

Shift_JIS is the most widely used encoding, 1 or 2 byte encoding. All ASCII and 1/2 width katakana are 1-byte, the rest are 2-byte. If the first byte value is between 0x00 and 0x7F it’s ASCII 1 byte, 0XA0 – 0XDf is 1-byte, 1/2 width katakana. all the rest are 2-byte characters.

The 2nd byte might be in the ASCII graphic code area 0x40 for example.

0x5C is the escape sequence (backslash in the US). Some Shift_JIS characters contain 0x5C in the 2nd byte. If the charset is specified incorrectly, you’ll end up getting different values — for instance, hex value 0X5C6e will conver to hex value 0X0A. The backslash at the end of the string, hex value 0X5C, will be removed (truncated) if charset is specified incorrectly.

Native MySQL does not support FULLTEXT search in Japanese, Korean and Chinese (CJK issue).

Japanese words do not delimit by space, so it can’t work. 2 ways to do this: dictionary based indexing, dividing words using a pre-installed dictionary. Also N-gram indexing — divide text by N letters (n could be 1, 2, 3 etc). MySQL + Senna implements this, supported by Sumisho Computer Systems.

Comments are closed.