(An Unofficial) Python FAQ Wiki

putting the community back in "maintained by the community"

What does 'UnicodeError: ASCII [decoding,encoding] error: ordinal not in range(128)' mean?

This error indicates that your Python installation can handle only 7-bit ASCII strings. There are a couple ways to fix or work around the problem.

If your programs must handle data in arbitrary character set encodings, the environment the application runs in will generally identify the encoding of the data it is handing you. You need to convert the input to Unicode data using that encoding. For example, a program that handles email or web input will typically find character set encoding information in Content-Type headers. This can then be used to properly convert input data to Unicode. Assuming the string referred to by value is encoded as UTF-8:

value = unicode(value, "utf-8")

will return a Unicode object. If the data is not correctly encoded as UTF-8, the above call will raise a UnicodeError exception.

If you only want strings converted to Unicode which have non-ASCII data, you can try converting them first assuming an ASCII encoding, and then generate Unicode objects if that fails:

try:
    x = unicode(value, "ascii")
except UnicodeError:
    value = unicode(value, "utf-8")
else:
    # value was valid ASCII data
    pass

It's possible to set a default encoding in a file called sitecustomize.py that's part of the Python library. However, this isn't recommended because changing the Python-wide default encoding may cause third-party extension modules to fail.

Note that on Windows, there is an encoding known as "mbcs", which uses an encoding specific to your current locale. In many cases, and particularly when working with COM, this may be an appropriate default encoding to use.

CATEGORY: programming

Comments

Given that ASCII is a subset of UTF-8, and that unicode(b, "ascii") returns a Unicode object, what's the point of trying to decode from ASCII first?

The last paragraph could be interpreted as a recommendation to change the default encoding if you're doing lots of COM work. Maybe the text could be tweaked to make it clear that "appropriate default encoding" doesn't refer to sitecustomize...

"You need to convert the input to Unicode data using that encoding."

Some text on the recommended "decode on the way in, encode on the way out" pattern would be nice.