We all know how a python script gets run. The python interpreter reads the source file, trans-compiles it into machine code and runs it. Mainly we write codes in ASCII, but what if we need a string containing unicode characters like: s = “你好哇”? The default encoding of source code is ASCII, and python interpreter will throw an error since it doesn’t know how to parse it. Therefore, the magic comment # coding=utf-8 shows at the first line. It tells the interpreter that the source code is encoded in UTF-8, and please parse it in this encoding. PEP explains it very well, and magic comment like # -*- coding: utf-8 -*- also works.

You may be full of questions now, e.g. what is encoding? What is unicode and UTF-8?It’s a long story, and for encoding, it is a way in which a computer understands the bytes in the memory, please read this post to learn more. For UTF-8 and unicode, I just want make it clear that unicode is an encoding standard and UTF-8 is just an encoding format of unicode. There are some other encodings, including GBK, which is widely used on Windows platform. Python programmers in China would always get mad dealing with the encoding on different platform, and this post will discuss some practical topics about that.

Output to the terminal

First, let’s introduce encodings on different places:

  1. Source code encoding: as we have discussed, can be specified with “# coding=utf-8”
  2. File system encoding: that’s how a text file gets encoded in the file system.
  3. Terminal encoding: it’s used in command line tools or terminals like iTerm, Terminal apps.

So, what’s the output of the following code snippet?

1
2
s = “你好哇”
print s

It depends. First, to make it understandable to the interpreter, a magic comment needs to be added, e.g., # coding=utf-8, then on Mac platform, it will work as expected and output 你好哇. But on the Windows platform, the terminal will print gibberish like “浣犲ソ鍝” when the system encoding is GBK. That’s because with GBK the interpreter cannot successfully decode what’s encoded with UTF-8. So, how to make it correct? You may try to change # coding=utf-8 to # coding=gbk, and hope it will work. However, the following error could make you totally lost:

1
2
SyntaxError: 'gbk' codec can't decode bytes in position 13-14:
illegal multibyte sequence

Emm… what’s that all about? Here is another encoding concept: source file encoding. Depending on your text editor, the source file encoding may be UTF-8, ANSI, etc. Here’s the story, you tell the interpreter that the source file is encoded in GBK with # coding=gbk, but actually the interpreter gets a source file encoded in UTF-8, and it can’t parse some characters e.g., “你好哇”, successfully. One solution is, with your favorite editor, save it in GBK encoding and using magic comment “# coding=gbk”. Make sure the encoding specified with magic comment be the same with the source file encoding.

There is another solution to this problem. Change the magic comment to # coding=utf-8 and make sure the file encoding is also UTF-8. We could first get the unicode string of “你好哇” by u = s.decode("utf-8”) and then encode it again with GBK: s_gbk = u.encode(“gbk”):

1
2
3
4
5
6
7
# coding=utf-8
s = "你好哇”
u = s.decode("utf-8”)
s_gbk = u.encode("gbk")
print s_gbk

Also, you can directly print a unicode string to the terminal. The unicode string would find the encoding, i.e., sys.stdout.encoding, to encode itself.

Input from the terminal

We can use sys.argv to get the command arguments and raw_input to get the user’s input. All these are strings encoded in the sys.stdout.encoding, which is UTF-8 on Mac/Linux and cp936 on Windows. You can take cp936 the same thing as GBK, though GBK is an extension to that. Here is the point, you need make sure the encodings are the same if you want to perform string manipulations, e.g., concatenation:

1
2
3
4
5
6
7
# coding=utf-8
import sys
encoding = sys.stdout.encoding
name = raw_input(u"你叫啥子?".encode(encoding))
print u"你好哇!" + name.decode(encoding)

name.decode(encoding) will return a unicode, and we can perform concatenation between unicode strings. Here I use sys.stdout.encoding to make it portable across different platforms.

Here is another more verbose version:

1
2
3
4
5
6
7
8
9
# coding=utf-8
import sys
encoding = sys.stdout.encoding
name = raw_input(u"你叫啥子?".encode(encoding))
greetings = "你好哇!" + name.decode(encoding).encode('utf-8')
print greetings.decode('utf-8').encode(encoding)

Here, 你好哇! is encoded in UTF-8, and we need to make name encoded in UTF-8 so that the two can be concatenated together to form a UTF-8 string, and then, another trans-coding is performed to make it encoded in stdout’s encoding.

File IO

File IO is similar to terminal IO, which just happens in a different place. So, the key point is the string encoding and the file encoding. The read() and write() functions are dumb, they just return a string or output a string, and it’s the programmer’s duty to decide which encoding should be used to encode/decode the string. Let’s see an example:

1
2
3
4
5
6
7
8
9
# coding=utf-8
s = '你好哇'
with open('hello.txt', 'w') as f:
f.write(s)
with open('hello.txt') as f:
print f.read().decode('utf-8')

String s is encoded in UTF-8, and we need to decode it after reading it from the file. You will get the right output if you remove .decode('utf-8’) on Mac/Linux platform since the default locale is UTF-8, but it will output gibberish on Windows platform since the locale there is ‘cp936’.

A practical problem

My friend and I have crawled some English name data, one property of an English name is the International Phonetic Alphabet (IPA). For example, the IPA for name “Sean” is “ʃɔn”, and there are some non-ASCII characters in it. So, if I dump the name data directly, like:

1
2
3
4
data = { 'name': 'sean', 'ipa': 'ʃɔn' }
with open('data.json', 'w') as f:
json.dump(data, f)

the ʃɔn would get in unicode escape form: {"ipa": "\u0283\u0254n", "name": "Sean”}, which is not readable for human beings. There is a parameter of dump called ensure_ascii, which defaults to True. As python doc puts it:

If ensure_ascii is true (the default), all non-ASCII characters in the output are escaped with \uXXXX sequences, and the result is a str instance consisting of ASCII characters only. If ensure_ascii is false, some chunks written to fp may be unicode instances. This usually happens because the input contains unicode strings or the encoding parameter is used. Unless fp.write() explicitly understands unicode (as in codecs.getwriter()) this is likely to cause an error.

So we can make the output contained unicode characters as follow:

1
2
with open('data-u.json', 'w') as f:
json.dump(data, f, ensure_ascii=False)

Here’s the output:

1
{"ipa": "ʃɔn", "name": "Sean"}

json.load() will automatically decode the string into unicode, so both of the {"ipa": "\u0283\u0254n", "name": "Sean”} and {"ipa": "ʃɔn", "name": "Sean”} can be successfully parsed.

Summary

This post introduces some encoding-related topics in python:

  1. We need to specify source encoding with a magic comment like # coding=utf-8 and make the encoding of the source file consistent with it, so that the python interpreter can understand the non-ascii characters in the source code.
  2. To trans-code between two encodings, we need first decode the string to unicode and then encode it into another encoding.
  3. Make sure the encoding of the string and terminal are same when you want to print it correctly in the terminal.
  4. The strings must be in the same encoding or format, e.g. UTF-8 or unicode so that they can be manipulated together.
  5. To make the JSON file readable, specify ensure_ascii as False.

More thoughts on it: I tried several ways to organize this post and removed some obscure parts of python unicode introduction, but still it’s kind of listing of knowledge points. Python2 receives lots of criticism because of the complicated string-unicode trans-coding. No doubt it is a defect, and in Python3 all the string literals are unicode. Also you can use from __future__ import unicode_literals to make the string literals be unicode in python2. However, some functions may only accept ordinary string instead of unicode, and still you need to care about the encoding. Understanding the encoding mechanism is important for learning python, especially dealing with multi-language and internationalization.