Unicode
Plain strings are converted into Unicode strings either explicitly, with the unicode
built-in, or implicitly, when you pass a plain string to a function that expects Unicode. In either case, the conversion is done by an auxiliary object known as a codec (for coder-decoder). A codec can also convert Unicode strings to plain strings, either explicitly, with the encode
method of Unicode strings, or implicitly.
To identify a codec, pass the codec name to unicode
or encode
. When you pass no codec name, and for implicit conversion, Python uses a default encoding, normally 'ascii'
. You can change the default encoding in the startup phase of a Python program, as covered in The site and sitecustomize Modules; see also setdefaultencoding
in The sys Module. However, such a change is not a good idea for most “serious” Python code: it might too easily interfere with code in the standard Python libraries or third-party modules, written to expect the normal 'ascii'
.
Every conversion has a parameter errors
, a string specifying how conversion errors are to be handled. The default is 'strict'
, meaning any error raises an exception. When errors
is 'replace'
, the conversion replaces each character that causes an error with '?'
in a plain-string result and with u'\ufffd'
in a Unicode result. When errors
is 'ignore'
, the conversion silently skips characters that cause errors. When errors
is 'xmlcharrefreplace'
, the conversion replaces each character that causes an error with the XML character reference ...
Get Python in a Nutshell, 2nd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.