When running unit tests in Django, I was getting a strange MySQL failure when attempting to insert non-ASCII Unicode characters into the database, for example:
Warning: Incorrect string value: ‘\xE2\x89\xA5 %’ for column ‘value’ at row 1
What is happening is that Django creates a new schema from scratch for testing. This new schema picks up the MySQL defaults. All my test tables ended up with Latin-1 encoding instead of UTF-8 encoding.
I needed to change mysqld to default to unicode internally so Django will run unit tests involving correctly.
In /etc/my.cnf I added the following:
[code lang=”c” light=”true”]
collation-server = utf8_unicode_ci
init-connect=’SET NAMES utf8′
character-set-server = utf8
Internationalization and localization have been long-time interests of mine, and I’ve had the privilege to delve deep into it with a large project for the last year. I submitted the following comment on the Real-World Haskell site in hopes that it would be of value for future readers.
In response to Phil’s question, and for general information for those not familiar with Unicode:
Unicode consists of “code points”; numeric values that represents a character or meta-character. This is purely an abstraction that dictates no specific digital representation.
How those numeric values are encoded and stored digitally is a separate issue. The main three encoding systems follow.
UTF-32 uses 32 bit “characters” to store each code point. A pro is that each “character” corresponds exactly to a code point. A con is that for English text, this quadruples the amount of storage space required, filling memory with a lot of zeroes.
UTF-16 uses 16-bit “characters” to encode code points. It requires either one or two “characters” to represent the entire Unicode set, with the most common values requiring a single “character”. A pro is that it’s more space-efficient than UTF-32. A con is that it’s a variable-length encoding system.
UTF-8 uses 8-bit “characters” to encode code points. It’s designed to be a superset of ASCII. It requires between one and four bytes to encode the entire Unicode set. A pro is that plain ASCII is UTF-8; no conversion is required. A con is that it’s a variable-length encoding, and is inefficient for Asian languages compared to UTF-16 (requiring 3 bytes rather than two for most Asian text).
In my internationalization and localization work, I have to ensure that UTF-8 is being correctly generated and stored.
The hixie.ch UTF-8 decoder is very flexible on the input it accepts and shows the internals of the decoding.
The rishida.net conversion tool is useful for converting formats.
The MySQL command line tool does not correctly handle UTF-8 encoded source files by default. If you use the ‘source’ command, Japanese turns into garbage. 🙁
To fix this, use the --default-character-set=utf8 option, e.g.
$ mysql -u username --default-character-set=utf8 -p
This should allow you to use the ‘source’ command to import your foreign language or otherwise special characters into MySQL without trouble.