A Gentle Introduction to Unicode

Scott Atwood, <satwood@paypal.com>

Overview

What is Unicode?

Unicode provides a unique number for every character,
no matter what the platform,
no matter what the program,
no matter what the language.

Why Unicode?

Terminology

Characters
Abstract units of a writing system
Glyph
The graphical representation of a character
Character Set
A mapping between abstract characters and abstract numbers
Encoding
A concrete representation of the numbers in a character set
Script
A collection of written symbols used together
Code Point
A single character in a character set
Code Unit
A portion of an encoded character

Character Sets and Encodings

UTF-8

Unicode Range UTF-8 Encoded Bytes
U+0000-U+007F0xxxxxxx
U+0080-U+07FF110xxxxx 10xxxxxx
U+0800-U+FFFF1110xxxx 10xxxxxx 10xxxxxx
U+10000-U+10FFFF11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

UTF-16

Unicode RangeScalar ValueUTF-16 Enocded Data
U+0000-U+D7FFxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
U+D800-U+DFFFN/AN/A
U+E000-U+FFFFxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
U+10000-U+10FFFF000uuuuu xxxxxxxx xxxxxxxx110110wwwwxxxxxx 110111xxxxxxxxxx
where wwww = uuuu - 1

UTF-32

Normalization

Script Complexities

Unicode supports scripts which may violate your assumptions. For example:

Unicode in C/C++

Mojibake

Q & A

Any questions?