Tag Archives: unicode

Setting MySQL to Default to Unicode

When running unit tests in Django, I was getting a strange MySQL failure when attempting to insert non-ASCII Unicode characters into the database, for example:

[code light=”true”]
Warning: Incorrect string value: ‘\xE2\x89\xA5 %’ for column ‘value’ at row 1

What is happening is that Django creates a new schema from scratch for testing. This new schema picks up the MySQL defaults. All my test tables ended up with Latin-1 encoding instead of UTF-8 encoding.

I needed to change mysqld to default to unicode internally so Django will run unit tests involving correctly.

In /etc/my.cnf I added the following:

[code lang=”c” light=”true”]


collation-server = utf8_unicode_ci
init-connect=’SET NAMES utf8′
character-set-server = utf8

h/t stackoverflow

What is Unicode?

Internationalization and localization have been long-time interests of mine, and I’ve had the privilege to delve deep into it with a large project for the last year. I submitted the following comment on the Real-World Haskell site in hopes that it would be of value for future readers.

In response to Phil’s question, and for general information for those not familiar with Unicode:

Unicode consists of “code points”; numeric values that represents a character or meta-character. This is purely an abstraction that dictates no specific digital representation.

How those numeric values are encoded and stored digitally is a separate issue. The main three encoding systems follow.

UTF-32 uses 32 bit “characters” to store each code point. A pro is that each “character” corresponds exactly to a code point.  A con is that for English text, this quadruples the amount of storage space required, filling memory with a lot of zeroes.

UTF-16 uses 16-bit “characters” to encode code points. It requires either one or two “characters” to represent the entire Unicode set, with the most common values requiring a single “character”. A pro is that it’s more space-efficient than UTF-32. A con is that it’s a variable-length encoding system.

UTF-8 uses 8-bit “characters” to encode code points. It’s designed to be a superset of ASCII. It requires between one and four bytes to encode the entire Unicode set. A pro is that plain ASCII is UTF-8; no conversion is required. A con is that it’s a variable-length encoding, and is inefficient for Asian languages compared to UTF-16 (requiring 3 bytes rather than two for most Asian text).