What is Unicode?

Internationalization and localization have been long-time interests of mine, and I’ve had the privilege to delve deep into it with a large project for the last year. I submitted the following comment on the Real-World Haskell site in hopes that it would be of value for future readers.

In response to Phil’s question, and for general information for those not familiar with Unicode:

Unicode consists of “code points”; numeric values that represents a character or meta-character. This is purely an abstraction that dictates no specific digital representation.

How those numeric values are encoded and stored digitally is a separate issue. The main three encoding systems follow.

UTF-32 uses 32 bit “characters” to store each code point. A pro is that each “character” corresponds exactly to a code point.  A con is that for English text, this quadruples the amount of storage space required, filling memory with a lot of zeroes.

UTF-16 uses 16-bit “characters” to encode code points. It requires either one or two “characters” to represent the entire Unicode set, with the most common values requiring a single “character”. A pro is that it’s more space-efficient than UTF-32. A con is that it’s a variable-length encoding system.

UTF-8 uses 8-bit “characters” to encode code points. It’s designed to be a superset of ASCII. It requires between one and four bytes to encode the entire Unicode set. A pro is that plain ASCII is UTF-8; no conversion is required. A con is that it’s a variable-length encoding, and is inefficient for Asian languages compared to UTF-16 (requiring 3 bytes rather than two for most Asian text).

Leave a Reply