Tag Archive utf-8

Setting MySQL to Default to Unicode

When running unit tests in Django, I was getting a strange MySQL failure when attempting to insert non-ASCII Unicode characters into the database, for example:

[code light=”true”]
Warning: Incorrect string value: ‘\xE2\x89\xA5 %’ for column ‘value’ at row 1
[/code]

What is happening is that Django creates a new schema from scratch for testing. This new schema picks up the MySQL defaults. All my test tables ended up with Latin-1 encoding instead of UTF-8 encoding.

I needed to change mysqld to default to unicode internally so Django will run unit tests involving correctly.

In /etc/my.cnf I added the following:

[code lang=”c” light=”true”]
[client]
default-character-set=utf8

[mysql]
default-character-set=utf8

[mysqld]
collation-server = utf8_unicode_ci
init-connect=’SET NAMES utf8′
character-set-server = utf8
[/code]

h/t stackoverflow

Tags, , , , , ,

What is Unicode?

Internationalization and localization have been long-time interests of mine, and I’ve had the privilege to delve deep into it with a large project for the last year. I submitted the following comment on the Real-World Haskell site in hopes that it would be of value for future readers.

In response to Phil’s question, and for general information for those not familiar with Unicode:

Unicode consists of “code points”; numeric values that represents a character or meta-character. This is purely an abstraction that dictates no specific digital representation.

How those numeric values are encoded and stored digitally is a separate issue. The main three encoding systems follow.

UTF-32 uses 32 bit “characters” to store each code point. A pro is that each “character” corresponds exactly to a code point.  A con is that for English text, this quadruples the amount of storage space required, filling memory with a lot of zeroes.

UTF-16 uses 16-bit “characters” to encode code points. It requires either one or two “characters” to represent the entire Unicode set, with the most common values requiring a single “character”. A pro is that it’s more space-efficient than UTF-32. A con is that it’s a variable-length encoding system.

UTF-8 uses 8-bit “characters” to encode code points. It’s designed to be a superset of ASCII. It requires between one and four bytes to encode the entire Unicode set. A pro is that plain ASCII is UTF-8; no conversion is required. A con is that it’s a variable-length encoding, and is inefficient for Asian languages compared to UTF-16 (requiring 3 bytes rather than two for most Asian text).

Tags, , , , , ,

Handy Unicode Tools

In my internationalization and localization work, I have to ensure that UTF-8 is being correctly generated and stored.

The hixie.ch UTF-8 decoder is very flexible on the input it accepts and shows the internals of the decoding.

The rishida.net conversion tool is useful for converting formats.

 

Tags, , ,

Importing UTF-8 into MySQL

The MySQL command line tool does not correctly handle UTF-8 encoded source files by default. If you use the ‘source’ command, Japanese turns into garbage. 🙁

To fix this, use the --default-character-set=utf8 option, e.g.

$ mysql -u username --default-character-set=utf8  -p

This should allow you to use the ‘source’ command to import your foreign language or otherwise special characters into MySQL without trouble.

Tags,

%d bloggers like this: