Unicode
Java uses the Unicode character encoding. (Java 1.3 uses Unicode Version 2.1. Support for Unicode 3.0 will be included in Java 1.4 or another future release.) Unicode is a 16-bit character encoding established by the Unicode Consortium, which describes the standard as follows (see http://unicode.org ):
The Unicode Standard defines codes for characters used in the major languages written today. Scripts include the European alphabetic scripts, Middle Eastern right-to-left scripts, and scripts of Asia. The Unicode Standard also includes punctuation marks, diacritics, mathematical symbols, technical symbols, arrows, dingbats, etc. ... In all, the Unicode Standard provides codes for 49,194 characters from the world’s alphabets, ideograph sets, and symbol collections.
In the canonical form of Unicode encoding, which is what Java
char
and String
types use, every character occupies
two bytes. The Unicode characters \u0020
to \u007E
are equivalent to the ASCII and
ISO8859-1 (Latin-1) characters 0x20
through 0x7E
. The Unicode
characters \u00A0
to \u00FF
are identical to the ISO8859-1
characters 0xA0
to 0xFF
. Thus, there is a trivial mapping
between Latin-1 and Unicode characters. A number of other portions of
the Unicode encoding are based on preexisting standards, such as
ISO8859-5 (Cyrillic) and ISO8859-8 (Hebrew), though the mappings
between these standards and Unicode may not be as trivial as the
Latin-1 mapping.
Note that Unicode support may be limited on many platforms. One of ...
Get Java Examples in a Nutshell, 3rd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.