|
On this page: unicode, str, hexadecimals, '\x', '\u', '\U', u'...', hex(), ord(), .encode(), .decode(), codecs module, codecs.open()
Code Point Representation in Python 2.7
In computing, every character is assigned a unique number, called code point. For example, the capital 'M' has the code point of 77. The number can then have different representations, depending on the base:
Letter |
Base-10 (decimal) | Base-2 (binary, in 2 bytes) | Base-16 (hexadecimal, in 2 bytes) |
M | 77 | 00000000 01001101 | 004D |
In Python 2, 'M', the str type, can be represented in its hexadecimal form by escaping each character with '\x'. Hence, 'M' and '\x4D' are identical objects; both are of type str.
|
>>> print '\x4D'
M
>>> '\x4D'
'M'
>>> '\x4D' == 'M'
True
>>> type('\x4D')
<type 'str'>
| |
You can of course look up the code point for any character online, but Python has built-in functions you can use. ord() converts a character to its corresponding ordinal integer, and hex() converts an integer to its hexadecimal representation.
|
>>> ord('M')
77
>>> hex(77)
'0x4d'
>>> hex(ord('o'))
'0x6f'
>>> hex(ord('m'))
'0x6d'
| |
The hexadecimal code point for 'o' is 6F and 'm' 6D, therefore the string 'Mom' represented in hexadecimal looks like below. Note that every character needs to be escaped with '\x'.
CAUTION: these hexadecimal strings are still of the str type: they are not Unicode. Unicode strings are always prefixed with u'...', which is explained below.
Unicode vs. str type
In Python 2, Unicode gets its own type distinct from strings: unicode. A Unicode string is always marked with the u'...' prefix. It comes in three variants: 8-bit with ordinary character, 16-bit starting with the lowercase '\u' character prefix, and finally 32-bit starting with the uppercase '\U' prefix:
Escape sequence | Meaning | Example |
none | Unicode character, 8-bit | u'M' |
\uxxxx | Unicode character with 16-bit hex value xxxx | u'\u004D' |
\Uxxxxxxxx | Unicode character with 32-bit hex value xxxxxxxx | u'\U0000004D' |
First of all, below demonstrates how 'M' and u'M' are different objects. The == operator -- which tests equality of value -- returns True, but the is operator -- which tests the identity of objects in memory -- returns False. They might print out the same and be considered of the same value, but they are of two different types: the former is a string (str) while the latter is a Unicode string (unicode).
|
>>> print u'M'
M
>>> u'M'
u'M'
>>> u'M' == 'M'
True
>>> u'M' is 'M'
False
>>> type(u'M')
<type 'unicode'>
| |
The three different Unicode representations, however, are truly identical. This is not surprising: there is one Unicode standard after all, they are just written differently.
|
>>> print u'\u004D'
M
>>> u'\u004D'
u'M'
>>> u'\U0000004D'
u'M'
>>> u'M' == u'\U0000004D'
True
>>> u'M' is u'\U0000004D'
True
| |
Below are Unicode strings 'Mom' and 'Mom and Dad'. Note that each 16-bit Unicode character is escaped, while 8-bit Unicode characters don't need to be. You can mix them up within one string:
|
>>> u'\u004D\u006F\u006D'
u'Mom'
>>> u'\u004D\u006F\u006D and Dad'
u'Mom and Dad'
| |
Conversion
.encode() and .decode() are the pair of methods used to convert between the Unicode and the string types. But be careful about the direction: .encode() is used to turn a Unicode string into a regular string, and .decode() works the other way. Many people find this counter-intuitive. In addition to the two methods, the type names double up as type conversion functions, as usual:
|
>>> u'M'.encode()
'M'
>>> 'M'.decode()
u'M'
>>> unicode('M')
u'M'
>>> str(u'M')
'M'
| |
Reading and Writing Unicode Files
The usual open method we have been using for file I/O handles text as the str type only. Therefore, to read and write Unicode-encoded files, we need a file method that is capable of handling Unicode: the codecs module provides its own codecs.open() method that lets user specify the encoding scheme. Below example uses this file containing Korean text in UTF-8 encoding:
|
>>> import codecs
>>> f = codecs.open('Korean-UTF8.txt', encoding='utf-8')
>>> lines = f.readlines()
>>> f.close()
>>> lines[0]
u'\uaf43 \r\n'
>>> print lines[0],
꽃
>>> lines[5]
u'\ud558\ub098\uc758 \ubab8\uc9d3\uc5d0 \uc9c0\ub098\uc9c0 \uc54a\uc558\ub2e4. \r\n'
>>> print lines[5],
하나의 몸짓에 지나지 않았다.
>>>
| |
Internally, the strings are stored as Unicode strings; print displays the characters in the more recognizable form. Note that printing will work only if you have the Korean fonts installed on your machine.
For writing, you supply the 'w' parameter with your codecs.open() method. After that, you can use the usual .write() and .writelines() methods as long as you supply them with Unicode strings:
|
>>> f = codecs.open('foo.txt', 'w', encoding='utf-8')
>>> f.write(u'Hello world!\n')
>>> f.close()
| |
As a matter of fact, codecs can handle all kinds of encoding, not just Unicode: you can use 'ascii' for ASCII, 'latin_1' for iso-8859-1, etc. The list of standard encodings used in Python 2 can be found on this page.
|