|
楼主 |
发表于 2005-9-6 20:59:55
|
显示全部楼层
Post by jhuangjiahua
UTF-8 是变长编码
UTF-16 是定长编码
RFC 2781 - UTF-16, an encoding of ISO 10646
2.1 Encoding UTF-16
Encoding of a single character from an ISO 10646 character value to
UTF-16 proceeds as follows. Let U be the character number, no greater
than 0x10FFFF.
1) If U < 0x10000, encode U as a 16-bit unsigned integer and
terminate.
2) Let U' = U - 0x10000. Because U is less than or equal to 0x10FFFF,
U' must be less than or equal to 0xFFFFF. That is, U' can be
represented in 20 bits.
3) Initialize two 16-bit unsigned integers, W1 and W2, to 0xD800 and
0xDC00, respectively. These integers each have 10 bits free to
encode the character value, for a total of 20 bits.
4) Assign the 10 high-order bits of the 20-bit U' to the 10 low-order
bits of W1 and the 10 low-order bits of U' to the 10 low-order
bits of W2. Terminate.
Graphically, steps 2 through 4 look like:
U' = yyyyyyyyyyxxxxxxxxxx
W1 = 110110yyyyyyyyyy
W2 = 110111xxxxxxxxxx |
|