UTF-8

UTF-8 is a variable-length encoding of Unicode. Characters 0 through 127, that is, the ASCII character set, are encoded in one byte each, exactly as they would be in ASCII. In ASCII, the byte with value 65 represents the letter A. In UTF-8, the byte with the value 65 also represents the letter A. There is a one-to-one identity mapping from ASCII characters to UTF-8 bytes. Thus, pure ASCII files are also acceptable UTF-8 files.

UTF-8 represents the characters from 128 to 2,047, a range that covers the most common non-ideographic scripts, in two bytes each. Characters from 2,048 to 65,535—mostly from Chinese, Japanese, and Korean—are represented in three bytes each. Characters with code points above 65,535 are represented in four bytes each. For a file that’s mostly Latin text, this effectively halves the file size from what it would be in UCS-2. However, for a file that’s primarily Japanese, Chinese, Korean, or one of the languages of the Indian subcontinent, the file size can grow by 50%. For most other living languages, the file size is close to the same as it would be in UCS-2.

UTF-8 is probably the most broadly supported encoding of Unicode. For instance, it’s how Java .class files store strings, it’s the native encoding of the BeOS, and it’s the default encoding an XML processor assumes unless told otherwise by a byte-order mark or an encoding declaration. Chances are pretty good that if a program tells you it’s saving Unicode, it’s really saving UTF-8.

Get XML in a Nutshell, 3rd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.