loading

Logout succeed

Logout succeed. See you again!

ebook img

Programming With Unicode PDF

release year2011
file size0.34 MB
languageEnglish

Preview Programming With Unicode

Programming with Unicode Documentation Release 2011 Victor Stinner August22,2015 Contents 1 Aboutthisbook 1 1.1 License . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Thanksto . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.3 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Unicodenightmare 3 3 Definitions 5 3.1 Character . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.2 Glyph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.3 Codepoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.4 Characterset(charset) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.5 Characterstring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.6 Bytestring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.7 UTF-8encodedstringsandUTF-16characterstrings . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.8 Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.9 Encodeacharacterstring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.10 Decodeabytestring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.11 Mojibake . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.12 Unicode: anUniversalCharacterSet(UCS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4 Unicode 9 4.1 UnicodeCharacterSet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4.2 Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4.3 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4.4 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 5 Charsetsandencodings 11 5.1 Encodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 5.2 Popularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 5.3 Encodingsperformances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 5.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 5.5 Handleundecodablebytesandunencodablecharacters . . . . . . . . . . . . . . . . . . . . . . . . . 12 5.6 Othercharsetsandencodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 6 Historicalcharsetsandencodings 15 6.1 ASCII. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 6.2 ISO8859family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 6.3 CJK:asianencodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 i 6.4 Cyrillic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 7 Unicodeencodings 21 7.1 UTF-8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 7.2 UCS-2,UCS-4,UTF-16andUTF-32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 7.3 UTF-7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 7.4 Byteordermarks(BOM). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 7.5 UTF-16surrogatepairs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 8 Howtoguesstheencodingofadocument? 25 8.1 IsASCII? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 8.2 CheckforBOMmarkers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 8.3 IsUTF-8?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 8.4 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 9 Goodpractices 29 9.1 Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 9.2 Unicodesupportlevels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 9.3 TesttheUnicodesupportofaprogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 9.4 Gettheencodingofyourinputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 9.5 Switchfrombytestringstocharacterstrings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 10 Operatingsystems 33 10.1 Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 10.2 MacOSX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 10.3 Locales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 10.4 Filesystems(filenames) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 11 Programminglanguages 41 11.1 Clanguage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 11.2 C++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 11.3 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 11.4 PHP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 11.5 Perl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 11.6 Java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 11.7 GoandD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 12 Libraries 51 12.1 Qtlibrary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 12.2 Thegliblibrary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 12.3 iconvlibrary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 12.4 ICUlibraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 12.5 libunistring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 13 Unicodeissues 55 13.1 Securityvulnerabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 14 Seealso 57 ii 1 CHAPTER About this book ThebookiswritteninreStructuredText(reST)syntaxandcompiledbySphinx. Istartedtowriteinthe25thSeptember2010. 1.1 License ThisbookisdistributedundertheCCBY-SA3.0license. 1.2 Thanks to Reviewers: AlexanderBelopolsky,AntoinePitrou,FethArezkiandNelleVaroquaux,NatalNgétal. 1.3 Notations • 0bBBBBBBBB: 8 bit unsigned number written in binary, first digit is the most significant. For example, 0b10000000is128. • 0xHHHH:numberwritteninhexadecimal,e.g. 0xFFFFis65535. • 0xHH 0xHH ...: bytesequencewithbyteswritteninhexadecimal,e.g. 0xC3 0xA9(2bytes)isthechar- acteré(U+00E9)encodedtoUTF-8. • U+HHHH:Unicodecharacterwithitscodepointwritteninhexadecimal. Forexample, U+20ACisthe“euro sign” character, code point 8,364. Big code point are written with more than 4 hexadecimal digits, e.g. U+10FFFFisthebiggest(unallocated)codepointofUnicodeCharacterSet6.0: 1,114,111. • A—B:rangeincludingstartandend. Examples: – 0x00—0x7Fistherange0through127(128bytes) – U+0000—U+00FFistherange0through255(256characters) • {U+HHHH,U+HHHH,...}: acharacterstring. Forexample,{U+0041,U+0042,U+0043}isthestring“abc” (3characters). 1 ProgrammingwithUnicodeDocumentation,Release2011 2 Chapter1. Aboutthisbook 2 CHAPTER Unicode nightmare Unicodeisthenightmareofmanydevelopers(andusers)fordifferent,andsometimesgoodreasons. In the 1980’s, only few people read documents in languages other their mother tongue and English. A computer supportedonlyasmallnumberoflanguages, theuserconfiguredhisregiontosupportlanguagesofclosecountries. Memoriesanddiskswereexpensive,allapplicationswerewrittentousebytestringsusing8bitsencodings: onebyte percharacterwasagoodcompromise. TodaywiththeInternetandtheglobalization,weallreadandexchangedocumentsfromeverywherearoundtheworld (even if we don’t understand everything). The problem is that documents rarely indicate their language (encoding), anddisplayingadocumentwiththewrongencodingleadstoawellknownproblem: mojibake. It is difficult to get, or worse, guess the encoding of a document. Except for encodings of the UTF family (coming from the Unicode standard), there is no reliable algorithm for that. We have to rely on statistics to guess the most probableencoding,whichisdonebymostInternetbrowsers. Unicodesupport byoperatingsystems, programminglanguagesandlibrariesvariesalot. Ingeneral, thesupportis basicornon-existent. EachoperatingsystemmanagesUnicodedifferently. Forexample,Windowsstoresfilenamesas Unicode,whereasUNIXandBSDoperatingsystemsusebytes. Mixing documents stored as bytes is possible, even if they use different encodings, but leads to mojibake. Because librariesandprogramsdoalsoignoreencodeanddecodewarningsorerrors,writeasinglecharacterwithadiacritic (anynon-ASCIIcharacter)issometimesenoughtogetanerror. Full Unicode support is complex because the Unicode charset is bigger than any other charset. For example, ISO 8859-1 contains 256 code points including 191 characters, whereas Unicode version 6.0 contains 248,966 assigned codepoints. TheUnicodestandardislargerthanjustacharset: itexplainsalsohowtodisplaycharacters(e.g. left- to-right for English and right-to-left for persian), how to normalize a character string (e.g. precomposed characters versusthedecomposedform),etc. ThisbookexplainshowtosympathizewithUnicode,andhowyoushouldmodifyyourprogramtoavoidmost,orall, issuesrelatedtoencodingsandUnicode. 3 ProgrammingwithUnicodeDocumentation,Release2011 4 Chapter2. Unicodenightmare 3 CHAPTER Definitions 3.1 Character 3.2 Glyph 3.3 Code point Acodepointisanunsignedinteger. Thesmallestcodepointiszero. Codepointsareusuallywrittenashexadecimal, e.g. “0x20AC”(8,364indecimal). 3.4 Character set (charset) A character set, abbreviated charset, is a mapping between code points and characters. The mapping has a fixed size. Forexample, most7bitsencodingshave128entries, andmost8bitsencodingshave256entries. Thebiggest charsetistheUnicodeCharacterSet6.0with1,114,112entries. Insomecharsets,codepointsarenotallcontiguous. Forexample,thecp1252charsetmapscodepointsfrom0though 255,butithasonly251entries: 0x81,0x8D,0x8F,0x90and0x9Dcodepointsarenotassigned. ExamplesoftheASCII charset: thedigitfive(“5”,U+0035)isassignedtothecodepoint0x35(53indecimal),and theuppercaseletter“A”(U+0041)tothecodepoint0x41(65). Thebiggestcodepointdependsonthesizeofthecharset. Forexample,thebiggestcodepointoftheASCIIcharsetis 127(27−1) Charsetexamples: Charset Codepoint Character ASCII 0x35 5(U+0035) ASCII 0x41 A(U+0041) ISO-8859-15 0xA4 C(U+20AC) UnicodeCharacterSet 0x20AC C(U+20AC) 3.5 Character string Acharacterstring,or“Unicodestring”,isastringwhereeachunitisacharacter. Dependingontheimplementation, eachcharactercanbeanyUnicodecharacter,oronlycharactersintherangeU+0000—U+FFFF,rangecalledtheBasic 5 ProgrammingwithUnicodeDocumentation,Release2011 MultilingualPlane(BMP).Thereare3differentimplementationsofcharacterstrings: • arrayof32bitsunsignedintegers(theUCS-4encoding): fullUnicoderange • arrayof16bitsunsignedintegers(UCS-2): BMPonly • arrayof16bitsunsignedintegerswithsurrogatepairs(UTF-16): fullUnicoderange UCS-4 use twice as much memory than UCS-2, but it supports all Unicode character. UTF-16 is a compromise betweenUCS-2andUCS-4: charactersintheBMPrangeuseoneUTF-16unit(16bits),charactersoutsidethisrange usetwoUTF-16units(asurrogatepair,32bits).Thisadvantageisalsothemaindisadvantageofthiskindofcharacter string. The length of a character string implemented using UTF-16 is the number of UTF-16 units, and not the number of characters, whichisconfusing. Forexample, theU+10FFFFcharacterisencoded astwoUTF-16units: {U+DBFF, U+DFFF}. If the character string only contains characters of the BMP range, the length is the number of charac- ters. Getting the nth character or the length in characters using UTF-16 has a complexity of 𝑂(𝑛), whereas it has a complexityof𝑂(1)forUCS-2andUCS-4strings. The Java language, the Qt library and Windows 2000 implement character strings with UTF-16. The C and Python languagesuseUTF-16orUCS-4dependingon:thesizeofthewchar_ttype(16or32bits)forC,andthecompilation mode(narroworwide)forPython. Windows95usesUCS-2strings. Seealso: UCS-2,UCS-4andUTF-16encodings,andsurrogatepairs. 3.6 Byte string Abytestringisacharacterstringencodedtoanencoding. Itisimplementedasanarrayof8bitsunsignedintegers. Itcanbecalledbyitsencoding. Forexample,abytestringencodedtoASCII iscalledan“ASCIIencodedstring”,or simplyan“ASCIIstring”. Thecharacterrangesupportedbyabytestringdependsonitsencoding,becauseanencodingisassociatedtoacharset. Forexample,anASCIIstringcanonlystorecharactersintherangeU+0000—U+007F. Theencodingisnotstoredexplicitlyinabytestring. Iftheencodingisnotdocumentedorattachedtothebytestring, theencodinghastobeguessed,whichisadifficulttask. Ifabytestringisdecodedfromthewrongencoding,itwill notbedisplayedcorrectly,leadingtoawellknownissue: mojibake. The same problem occurs if two byte strings encoded to different encodings are concatenated. Never concatenate bytestringsencodedtodifferentencodings! Usecharacterstrings,insteadofbytestrings,toavoidmojibakeissues. PHP5onlysupportsbytestrings. IntheClanguage,“strings”areusuallybytestringswhichareimplementedasthe char*type(orconst char*). Seealso: Thechar*typeoftheClanguageandthemojibakeissue. 3.7 UTF-8 encoded strings and UTF-16 character strings AUTF-8stringisaparticularcase,becauseUTF-8isabletoencodeallUnicodecharacters1 . ButaUTF-8stringis notaUnicodestringbecausethestringunitisbyteandnotcharacter: youcangetanindividualbyteofamultibyte character. 1AUTF-8encodershouldnotencodesurrogatecharacters(U+D800—U+DFFF). 6 Chapter3. Definitions

See more

The list of books you might like