When running unit tests in Django, I was getting a strange MySQL failure when attempting to insert non-ASCII Unicode characters into the database, for example:
Warning: Incorrect string value: ‘\xE2\x89\xA5 %’ for column ‘value’ at row 1
What is happening is that Django creates a new schema from scratch for testing. This new schema picks up the MySQL defaults. All my test tables ended up with Latin-1 encoding instead of UTF-8 encoding.
I needed to change mysqld to default to unicode internally so Django will run unit tests involving correctly.
In /etc/my.cnf I added the following:
[code lang=”c” light=”true”]
collation-server = utf8_unicode_ci
init-connect=’SET NAMES utf8′
character-set-server = utf8
Internationalization and localization have been long-time interests of mine, and I’ve had the privilege to delve deep into it with a large project for the last year. I submitted the following comment on the Real-World Haskell site in hopes that it would be of value for future readers.
In response to Phil’s question, and for general information for those not familiar with Unicode:
Unicode consists of “code points”; numeric values that represents a character or meta-character. This is purely an abstraction that dictates no specific digital representation.
How those numeric values are encoded and stored digitally is a separate issue. The main three encoding systems follow.
UTF-32 uses 32 bit “characters” to store each code point. A pro is that each “character” corresponds exactly to a code point. A con is that for English text, this quadruples the amount of storage space required, filling memory with a lot of zeroes.
UTF-16 uses 16-bit “characters” to encode code points. It requires either one or two “characters” to represent the entire Unicode set, with the most common values requiring a single “character”. A pro is that it’s more space-efficient than UTF-32. A con is that it’s a variable-length encoding system.
UTF-8 uses 8-bit “characters” to encode code points. It’s designed to be a superset of ASCII. It requires between one and four bytes to encode the entire Unicode set. A pro is that plain ASCII is UTF-8; no conversion is required. A con is that it’s a variable-length encoding, and is inefficient for Asian languages compared to UTF-16 (requiring 3 bytes rather than two for most Asian text).
In my internationalization and localization work, I have to ensure that UTF-8 is being correctly generated and stored.
The hixie.ch UTF-8 decoder is very flexible on the input it accepts and shows the internals of the decoding.
The rishida.net conversion tool is useful for converting formats.