Charset & Encoding in Python

Background

Character Set and Encoding System are different concepts, but often confused together.

A character set is just a standardized set of chars or symbols. For example, English alphabet "a" to "z" can be a character set, while an encoding system is a standardized way to transform a sequence of characters (of a given char set) into a sequence of 0 and 1.

Character set

ASCII

One of the simplest standardized character set is "ASCII", ASCII contains 128 symbols. It includes all the {letters, digits, punctuations} you see on a PC keyboard. ASCII is designed for languages that use Latin alphabet only. ASCII cannot be used for Chinese characters

Unicode

Unicode's character set includes ALL human language's written symbols. It includes the tens of thousands Chinese characters, math symbols, as well as characters of other languages. Actually, ASCII is a subset of Unicode.

Encoding System

What's Character Encoding?

Any file has to go through encoding/decoding in order to be properly stored as file or displayed on screen. Suppose your language is Chinese, your computer needs a way to translate the character set of your language's writing system into a sequence of 1s and 0s. This transformation is called Character encoding.

A encoding system defines a character set implicitly. Because it needs to define what characters it is designed to handle.
In the early days of computing, these two concepts are not clearly made distinct, and are just called a char set or encoding system. For example, ASCII does not really separate the concepts, since it's very simple, dealing with only 128 chars (including invisible "control characters" (code sequence)). Another example: HTML has ; the syntax contains the word “charset”, but it's actually about encoding, not charset.

Unicode's Encoding System

Unicode defines several encoding system. Each character in Unicode is given a unique ID. This id is a number (integer), and is called the char's code point. UTF-8 is the one which is usded widely in different fields.
As described in follow tables, the code point of chinese charater 汉 is 0000 6C49. It matches the pattern in third row. After encoding with UTF-8, the byte string is E6 B1 89.

code point of "汉"   --->  0000 6C49

range of code point |     pattern of utf-8 encoding system
         HEX        |             BIN
--------------------+---------------------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

        6    C    4    9
     0110 1100 0100 1001 
 --------------------------
     0110   110001   001001
 1110XXXX 10XXXXXX 10XXXXXX 
 11100110 10110001 10001001 
    E   6    B   1    8   9

The result of above procedure can be verified in Python Interpreter:

c = u"汉"
# >>> u"\u6c49"
c.encode("utf-8")
# >>> "\xe6\xb1\x89"

Encoding in Python

In Python 3.x, everything is Unicode. The encoding issues mainly appear in Python 2.x.

Encoding of source file

If there are some special characters (chinese, noncommon symbols or other non-ASCII characters) in the Python source file, A declaration of encoding system should be placed in the head of source file. Otherwise, the interpreter does not know use which encoding system to handle the unexpected characters. This is because the default encoding used by interpreter is ASCII.

# coding:utf8
# coding=utf-8
# -*- coding:utf-8 -*-      recommended, better looking

The delcaration of encoding system conventionally follows by SheBang notation, and contains the key words: "coding" and encoding system(utf-8, gbk, ...).

Note:, the encoding system in the declaration should be consistent with source file encoding. Generally, common IDE will take care of this. But users of text editors shall pay extra attention.

Encoding in console

There is a chance that print will raise below exception.

>>> var = u"中国"
>>> print var 
Traceback (most recent call last):
  File "", line 1, in 
UnicodeEncodeError: 'ascii' codec can't encode characters in position 835-844: ordinal not in range(128)

Such case usually stems from the worng system environment variable or configuration of console. In my assumption, print will do some encoding convertions according to the parameters of environment.

You could put following statements in your .bash_profile

export LANG="en_US.UTF-8"
export LC_ALL=en_US.UTF-8 # For OS X

If do not change the settings in the system, you could also try to add extra process of converting encoding system in the code.

# eg. gb2312 is the encoding system of console
if not isinstance(var, unicode): 
    print var.decode("utf-8").encode("gb2312")

Encoding inside the source code

Both str and unicode datatype are subclass of basestring. Actually, str is just byte sequence, which is encoded from the unicode characters. Strings that contain Unicode characters must start with u in front of the string.
str and unicode datatype can be converted to each other through str.decode() and unicode.encode().

#-*- coding: utf-8 -*-

u_str = u"I ♥ U"                       # unicode string, start with “u” or “ur”
print(u_str)                           # I ♥ U

u_str = u"汉"                          # len(u_str) == 1 '\u6c49'
utf8_str = "汉"                        # len(utf8_str) == 3  '\xe6\xb1\x89'

u_str = utf8_str.decode("utf-8")       # utf-8 ---> unicode
utf8_str = u_str.encode("utf-8")       # unicode ---> utf-8

u_str = unicode(utf8_str, "utf-8")     # utf-8 ---> unicode
utf8_str = str(u_str)                  # unicode ---> utf-8  sys.setdefaultencoding("utf-8')

u_str = u_str.decode("utf-8")          # no meaning,  could throw exception.
utf8_str = utf8_str.encode("utf-8")    # no meaning,  could throw exception.

Encoding in I/O

When reading from a file or other objecs return byte sequence, the encoding system should be choosen for decoding. On the other hand, before writing or sending a byte sequence to the target(function, file or socket), the process of encoding should also be considered.

#-*- coding: utf-8 -*-
import codecs  # recommend using codecs module

with codecs.open("test.txt", encoding="utf-8") as f:
    u = f.read()
    print type(u) # <type 'unicode'>

with codecs.open("test.txt", "a", encoding="utf-8") as f:
    u = u"汉"
    f.write(u) # write unicode into file

with codecs.open("test.txt", "a", encoding="utf-8") as f:
    s = u"汉".encode("gbk")
    f.write(s)  # automatically decode gbk to unicode, then encoding to utf-8.

json.loads(str, encoding="utf-8") # loading json file while specifying encoding system

Moon River

A tranquil niche for contemplation