16.11. Reading or Writing Unicode Characters
Problem
You want to read Unicode-encoded characters from a file, database, or form; or, you want to write Unicode-encoded characters.
Solution
Use
utf8_encode( )
to convert single-byte ISO-8859-1 encoded
characters to UTF-8:
print utf8_encode('Kurt Gödel is swell.');
Use utf8_decode( )
to
convert UTF-8 encoded characters to single-byte ISO-8859-1 encoded
characters:
print utf8_decode("Kurt G\xc3\xb6del is swell.");
Discussion
There are 256 possible ASCII characters. The characters between codes 0 and 127 are standardized: control characters, letters and numbers, and punctuation. There are different rules, however, for the characters that codes 128-255 map to. One encoding is called ISO-8859-1, which includes characters necessary for writing most European languages, such as the ö in Gödel or the ñ in pestaña. Many languages, though, require more than 256 characters, and a character set that can express more than one language requires even more characters. This is where Unicode saves the day; its UTF-8 encoding can represent more than a million characters.
This increased functionality comes at the cost of space. ASCII characters are stored in just one byte; UTF-8 encoded characters need up to four bytes. Table 16-2 shows the byte representations of UTF-8 encoded characters.
Table 16-2. UTF-8 byte representation
Character code range |
Bytes used |
Byte 1 |
Byte 2 |
Byte 3 |
Byte 4 |
---|---|---|---|---|---|
|
1 |
| |||
|
Get PHP Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.