Programming with Unicode Documentation Release 2011 Victor Stinner August22,2015 Contents 1 Aboutthisbook 1 1.1 License . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Thanksto . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.3 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Unicodenightmare 3 3 Deﬁnitions 5 3.1 Character . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.2 Glyph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.3 Codepoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.4 Characterset(charset) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.5 Characterstring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.6 Bytestring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.7 UTF-8encodedstringsandUTF-16characterstrings . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.8 Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.9 Encodeacharacterstring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.10 Decodeabytestring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.11 Mojibake . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.12 Unicode: anUniversalCharacterSet(UCS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4 Unicode 9 4.1 UnicodeCharacterSet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4.2 Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4.3 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4.4 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 5 Charsetsandencodings 11 5.1 Encodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 5.2 Popularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 5.3 Encodingsperformances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 5.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 5.5 Handleundecodablebytesandunencodablecharacters . . . . . . . . . . . . . . . . . . . . . . . . . 12 5.6 Othercharsetsandencodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 6 Historicalcharsetsandencodings 15 6.1 ASCII. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 6.2 ISO8859family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 6.3 CJK:asianencodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 i 6.4 Cyrillic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 7 Unicodeencodings 21 7.1 UTF-8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 7.2 UCS-2,UCS-4,UTF-16andUTF-32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 7.3 UTF-7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 7.4 Byteordermarks(BOM). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 7.5 UTF-16surrogatepairs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 8 Howtoguesstheencodingofadocument? 25 8.1 IsASCII? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 8.2 CheckforBOMmarkers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 8.3 IsUTF-8?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 8.4 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 9 Goodpractices 29 9.1 Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 9.2 Unicodesupportlevels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 9.3 TesttheUnicodesupportofaprogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 9.4 Gettheencodingofyourinputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 9.5 Switchfrombytestringstocharacterstrings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 10 Operatingsystems 33 10.1 Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 10.2 MacOSX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 10.3 Locales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 10.4 Filesystems(ﬁlenames) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 11 Programminglanguages 41 11.1 Clanguage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 11.2 C++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 11.3 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 11.4 PHP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 11.5 Perl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 11.6 Java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 11.7 GoandD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 12 Libraries 51 12.1 Qtlibrary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 12.2 Thegliblibrary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 12.3 iconvlibrary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 12.4 ICUlibraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 12.5 libunistring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 13 Unicodeissues 55 13.1 Securityvulnerabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 14 Seealso 57 ii 1 CHAPTER About this book ThebookiswritteninreStructuredText(reST)syntaxandcompiledbySphinx. Istartedtowriteinthe25thSeptember2010. 1.1 License ThisbookisdistributedundertheCCBY-SA3.0license. 1.2 Thanks to Reviewers: AlexanderBelopolsky,AntoinePitrou,FethArezkiandNelleVaroquaux,NatalNgétal. 1.3 Notations • 0bBBBBBBBB: 8 bit unsigned number written in binary, ﬁrst digit is the most signiﬁcant. For example, 0b10000000is128. • 0xHHHH:numberwritteninhexadecimal,e.g. 0xFFFFis65535. • 0xHH 0xHH ...: bytesequencewithbyteswritteninhexadecimal,e.g. 0xC3 0xA9(2bytes)isthechar- acteré(U+00E9)encodedtoUTF-8. • U+HHHH:Unicodecharacterwithitscodepointwritteninhexadecimal. Forexample, U+20ACisthe“euro sign” character, code point 8,364. Big code point are written with more than 4 hexadecimal digits, e.g. U+10FFFFisthebiggest(unallocated)codepointofUnicodeCharacterSet6.0: 1,114,111. • A—B:rangeincludingstartandend. Examples: – 0x00—0x7Fistherange0through127(128bytes) – U+0000—U+00FFistherange0through255(256characters) • {U+HHHH,U+HHHH,...}: acharacterstring. Forexample,{U+0041,U+0042,U+0043}isthestring“abc” (3characters). 1 ProgrammingwithUnicodeDocumentation,Release2011 2 Chapter1. Aboutthisbook 2 CHAPTER Unicode nightmare Unicodeisthenightmareofmanydevelopers(andusers)fordifferent,andsometimesgoodreasons. In the 1980’s, only few people read documents in languages other their mother tongue and English. A computer supportedonlyasmallnumberoflanguages, theuserconﬁguredhisregiontosupportlanguagesofclosecountries. Memoriesanddiskswereexpensive,allapplicationswerewrittentousebytestringsusing8bitsencodings: onebyte percharacterwasagoodcompromise. TodaywiththeInternetandtheglobalization,weallreadandexchangedocumentsfromeverywherearoundtheworld (even if we don’t understand everything). The problem is that documents rarely indicate their language (encoding), anddisplayingadocumentwiththewrongencodingleadstoawellknownproblem: mojibake. It is difﬁcult to get, or worse, guess the encoding of a document. Except for encodings of the UTF family (coming from the Unicode standard), there is no reliable algorithm for that. We have to rely on statistics to guess the most probableencoding,whichisdonebymostInternetbrowsers. Unicodesupport byoperatingsystems, programminglanguagesandlibrariesvariesalot. Ingeneral, thesupportis basicornon-existent. EachoperatingsystemmanagesUnicodedifferently. Forexample,Windowsstoresﬁlenamesas Unicode,whereasUNIXandBSDoperatingsystemsusebytes. Mixing documents stored as bytes is possible, even if they use different encodings, but leads to mojibake. Because librariesandprogramsdoalsoignoreencodeanddecodewarningsorerrors,writeasinglecharacterwithadiacritic (anynon-ASCIIcharacter)issometimesenoughtogetanerror. Full Unicode support is complex because the Unicode charset is bigger than any other charset. For example, ISO 8859-1 contains 256 code points including 191 characters, whereas Unicode version 6.0 contains 248,966 assigned codepoints. TheUnicodestandardislargerthanjustacharset: itexplainsalsohowtodisplaycharacters(e.g. left- to-right for English and right-to-left for persian), how to normalize a character string (e.g. precomposed characters versusthedecomposedform),etc. ThisbookexplainshowtosympathizewithUnicode,andhowyoushouldmodifyyourprogramtoavoidmost,orall, issuesrelatedtoencodingsandUnicode. 3 ProgrammingwithUnicodeDocumentation,Release2011 4 Chapter2. Unicodenightmare 3 CHAPTER Deﬁnitions 3.1 Character 3.2 Glyph 3.3 Code point Acodepointisanunsignedinteger. Thesmallestcodepointiszero. Codepointsareusuallywrittenashexadecimal, e.g. “0x20AC”(8,364indecimal). 3.4 Character set (charset) A character set, abbreviated charset, is a mapping between code points and characters. The mapping has a ﬁxed size. Forexample, most7bitsencodingshave128entries, andmost8bitsencodingshave256entries. Thebiggest charsetistheUnicodeCharacterSet6.0with1,114,112entries. Insomecharsets,codepointsarenotallcontiguous. Forexample,thecp1252charsetmapscodepointsfrom0though 255,butithasonly251entries: 0x81,0x8D,0x8F,0x90and0x9Dcodepointsarenotassigned. ExamplesoftheASCII charset: thedigitﬁve(“5”,U+0035)isassignedtothecodepoint0x35(53indecimal),and theuppercaseletter“A”(U+0041)tothecodepoint0x41(65). Thebiggestcodepointdependsonthesizeofthecharset. Forexample,thebiggestcodepointoftheASCIIcharsetis 127(27−1) Charsetexamples: Charset Codepoint Character ASCII 0x35 5(U+0035) ASCII 0x41 A(U+0041) ISO-8859-15 0xA4 C(U+20AC) UnicodeCharacterSet 0x20AC C(U+20AC) 3.5 Character string Acharacterstring,or“Unicodestring”,isastringwhereeachunitisacharacter. Dependingontheimplementation, eachcharactercanbeanyUnicodecharacter,oronlycharactersintherangeU+0000—U+FFFF,rangecalledtheBasic 5 ProgrammingwithUnicodeDocumentation,Release2011 MultilingualPlane(BMP).Thereare3differentimplementationsofcharacterstrings: • arrayof32bitsunsignedintegers(theUCS-4encoding): fullUnicoderange • arrayof16bitsunsignedintegers(UCS-2): BMPonly • arrayof16bitsunsignedintegerswithsurrogatepairs(UTF-16): fullUnicoderange UCS-4 use twice as much memory than UCS-2, but it supports all Unicode character. UTF-16 is a compromise betweenUCS-2andUCS-4: charactersintheBMPrangeuseoneUTF-16unit(16bits),charactersoutsidethisrange usetwoUTF-16units(asurrogatepair,32bits).Thisadvantageisalsothemaindisadvantageofthiskindofcharacter string. The length of a character string implemented using UTF-16 is the number of UTF-16 units, and not the number of characters, whichisconfusing. Forexample, theU+10FFFFcharacterisencoded astwoUTF-16units: {U+DBFF, U+DFFF}. If the character string only contains characters of the BMP range, the length is the number of charac- ters. Getting the nth character or the length in characters using UTF-16 has a complexity of 𝑂(𝑛), whereas it has a complexityof𝑂(1)forUCS-2andUCS-4strings. The Java language, the Qt library and Windows 2000 implement character strings with UTF-16. The C and Python languagesuseUTF-16orUCS-4dependingon:thesizeofthewchar_ttype(16or32bits)forC,andthecompilation mode(narroworwide)forPython. Windows95usesUCS-2strings. Seealso: UCS-2,UCS-4andUTF-16encodings,andsurrogatepairs. 3.6 Byte string Abytestringisacharacterstringencodedtoanencoding. Itisimplementedasanarrayof8bitsunsignedintegers. Itcanbecalledbyitsencoding. Forexample,abytestringencodedtoASCII iscalledan“ASCIIencodedstring”,or simplyan“ASCIIstring”. Thecharacterrangesupportedbyabytestringdependsonitsencoding,becauseanencodingisassociatedtoacharset. Forexample,anASCIIstringcanonlystorecharactersintherangeU+0000—U+007F. Theencodingisnotstoredexplicitlyinabytestring. Iftheencodingisnotdocumentedorattachedtothebytestring, theencodinghastobeguessed,whichisadifﬁculttask. Ifabytestringisdecodedfromthewrongencoding,itwill notbedisplayedcorrectly,leadingtoawellknownissue: mojibake. The same problem occurs if two byte strings encoded to different encodings are concatenated. Never concatenate bytestringsencodedtodifferentencodings! Usecharacterstrings,insteadofbytestrings,toavoidmojibakeissues. PHP5onlysupportsbytestrings. IntheClanguage,“strings”areusuallybytestringswhichareimplementedasthe char*type(orconst char*). Seealso: Thechar*typeoftheClanguageandthemojibakeissue. 3.7 UTF-8 encoded strings and UTF-16 character strings AUTF-8stringisaparticularcase,becauseUTF-8isabletoencodeallUnicodecharacters1 . ButaUTF-8stringis notaUnicodestringbecausethestringunitisbyteandnotcharacter: youcangetanindividualbyteofamultibyte character. 1AUTF-8encodershouldnotencodesurrogatecharacters(U+D800—U+DFFF). 6 Chapter3. Deﬁnitions

Programming With Unicode PDF

2011

0.34 MB

English

by Victor Stinner

#additional_collections #folkscanomy #folkscanomy_miscellaneous

Checking for file health...

Preview Programming With Unicode

Programming with Unicode Documentation Release 2011 Victor Stinner August22,2015 Contents 1 Aboutthisbook 1 1.1 License . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Thanksto . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.3 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Unicodenightmare 3 3 Deﬁnitions 5 3.1 Character . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.2 Glyph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.3 Codepoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.4 Characterset(charset) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.5 Characterstring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.6 Bytestring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.7 UTF-8encodedstringsandUTF-16characterstrings . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.8 Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.9 Encodeacharacterstring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.10 Decodeabytestring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.11 Mojibake . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.12 Unicode: anUniversalCharacterSet(UCS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4 Unicode 9 4.1 UnicodeCharacterSet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4.2 Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4.3 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4.4 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 5 Charsetsandencodings 11 5.1 Encodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 5.2 Popularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 5.3 Encodingsperformances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 5.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 5.5 Handleundecodablebytesandunencodablecharacters . . . . . . . . . . . . . . . . . . . . . . . . . 12 5.6 Othercharsetsandencodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 6 Historicalcharsetsandencodings 15 6.1 ASCII. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 6.2 ISO8859family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 6.3 CJK:asianencodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 i 6.4 Cyrillic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 7 Unicodeencodings 21 7.1 UTF-8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 7.2 UCS-2,UCS-4,UTF-16andUTF-32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 7.3 UTF-7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 7.4 Byteordermarks(BOM). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 7.5 UTF-16surrogatepairs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 8 Howtoguesstheencodingofadocument? 25 8.1 IsASCII? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 8.2 CheckforBOMmarkers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 8.3 IsUTF-8?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 8.4 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 9 Goodpractices 29 9.1 Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 9.2 Unicodesupportlevels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 9.3 TesttheUnicodesupportofaprogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 9.4 Gettheencodingofyourinputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 9.5 Switchfrombytestringstocharacterstrings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 10 Operatingsystems 33 10.1 Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 10.2 MacOSX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 10.3 Locales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 10.4 Filesystems(ﬁlenames) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 11 Programminglanguages 41 11.1 Clanguage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 11.2 C++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 11.3 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 11.4 PHP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 11.5 Perl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 11.6 Java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 11.7 GoandD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 12 Libraries 51 12.1 Qtlibrary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 12.2 Thegliblibrary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 12.3 iconvlibrary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 12.4 ICUlibraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 12.5 libunistring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 13 Unicodeissues 55 13.1 Securityvulnerabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 14 Seealso 57 ii 1 CHAPTER About this book ThebookiswritteninreStructuredText(reST)syntaxandcompiledbySphinx. Istartedtowriteinthe25thSeptember2010. 1.1 License ThisbookisdistributedundertheCCBY-SA3.0license. 1.2 Thanks to Reviewers: AlexanderBelopolsky,AntoinePitrou,FethArezkiandNelleVaroquaux,NatalNgétal. 1.3 Notations • 0bBBBBBBBB: 8 bit unsigned number written in binary, ﬁrst digit is the most signiﬁcant. For example, 0b10000000is128. • 0xHHHH:numberwritteninhexadecimal,e.g. 0xFFFFis65535. • 0xHH 0xHH ...: bytesequencewithbyteswritteninhexadecimal,e.g. 0xC3 0xA9(2bytes)isthechar- acteré(U+00E9)encodedtoUTF-8. • U+HHHH:Unicodecharacterwithitscodepointwritteninhexadecimal. Forexample, U+20ACisthe“euro sign” character, code point 8,364. Big code point are written with more than 4 hexadecimal digits, e.g. U+10FFFFisthebiggest(unallocated)codepointofUnicodeCharacterSet6.0: 1,114,111. • A—B:rangeincludingstartandend. Examples: – 0x00—0x7Fistherange0through127(128bytes) – U+0000—U+00FFistherange0through255(256characters) • {U+HHHH,U+HHHH,...}: acharacterstring. Forexample,{U+0041,U+0042,U+0043}isthestring“abc” (3characters). 1 ProgrammingwithUnicodeDocumentation,Release2011 2 Chapter1. Aboutthisbook 2 CHAPTER Unicode nightmare Unicodeisthenightmareofmanydevelopers(andusers)fordifferent,andsometimesgoodreasons. In the 1980’s, only few people read documents in languages other their mother tongue and English. A computer supportedonlyasmallnumberoflanguages, theuserconﬁguredhisregiontosupportlanguagesofclosecountries. Memoriesanddiskswereexpensive,allapplicationswerewrittentousebytestringsusing8bitsencodings: onebyte percharacterwasagoodcompromise. TodaywiththeInternetandtheglobalization,weallreadandexchangedocumentsfromeverywherearoundtheworld (even if we don’t understand everything). The problem is that documents rarely indicate their language (encoding), anddisplayingadocumentwiththewrongencodingleadstoawellknownproblem: mojibake. It is difﬁcult to get, or worse, guess the encoding of a document. Except for encodings of the UTF family (coming from the Unicode standard), there is no reliable algorithm for that. We have to rely on statistics to guess the most probableencoding,whichisdonebymostInternetbrowsers. Unicodesupport byoperatingsystems, programminglanguagesandlibrariesvariesalot. Ingeneral, thesupportis basicornon-existent. EachoperatingsystemmanagesUnicodedifferently. Forexample,Windowsstoresﬁlenamesas Unicode,whereasUNIXandBSDoperatingsystemsusebytes. Mixing documents stored as bytes is possible, even if they use different encodings, but leads to mojibake. Because librariesandprogramsdoalsoignoreencodeanddecodewarningsorerrors,writeasinglecharacterwithadiacritic (anynon-ASCIIcharacter)issometimesenoughtogetanerror. Full Unicode support is complex because the Unicode charset is bigger than any other charset. For example, ISO 8859-1 contains 256 code points including 191 characters, whereas Unicode version 6.0 contains 248,966 assigned codepoints. TheUnicodestandardislargerthanjustacharset: itexplainsalsohowtodisplaycharacters(e.g. left- to-right for English and right-to-left for persian), how to normalize a character string (e.g. precomposed characters versusthedecomposedform),etc. ThisbookexplainshowtosympathizewithUnicode,andhowyoushouldmodifyyourprogramtoavoidmost,orall, issuesrelatedtoencodingsandUnicode. 3 ProgrammingwithUnicodeDocumentation,Release2011 4 Chapter2. Unicodenightmare 3 CHAPTER Deﬁnitions 3.1 Character 3.2 Glyph 3.3 Code point Acodepointisanunsignedinteger. Thesmallestcodepointiszero. Codepointsareusuallywrittenashexadecimal, e.g. “0x20AC”(8,364indecimal). 3.4 Character set (charset) A character set, abbreviated charset, is a mapping between code points and characters. The mapping has a ﬁxed size. Forexample, most7bitsencodingshave128entries, andmost8bitsencodingshave256entries. Thebiggest charsetistheUnicodeCharacterSet6.0with1,114,112entries. Insomecharsets,codepointsarenotallcontiguous. Forexample,thecp1252charsetmapscodepointsfrom0though 255,butithasonly251entries: 0x81,0x8D,0x8F,0x90and0x9Dcodepointsarenotassigned. ExamplesoftheASCII charset: thedigitﬁve(“5”,U+0035)isassignedtothecodepoint0x35(53indecimal),and theuppercaseletter“A”(U+0041)tothecodepoint0x41(65). Thebiggestcodepointdependsonthesizeofthecharset. Forexample,thebiggestcodepointoftheASCIIcharsetis 127(27−1) Charsetexamples: Charset Codepoint Character ASCII 0x35 5(U+0035) ASCII 0x41 A(U+0041) ISO-8859-15 0xA4 C(U+20AC) UnicodeCharacterSet 0x20AC C(U+20AC) 3.5 Character string Acharacterstring,or“Unicodestring”,isastringwhereeachunitisacharacter. Dependingontheimplementation, eachcharactercanbeanyUnicodecharacter,oronlycharactersintherangeU+0000—U+FFFF,rangecalledtheBasic 5 ProgrammingwithUnicodeDocumentation,Release2011 MultilingualPlane(BMP).Thereare3differentimplementationsofcharacterstrings: • arrayof32bitsunsignedintegers(theUCS-4encoding): fullUnicoderange • arrayof16bitsunsignedintegers(UCS-2): BMPonly • arrayof16bitsunsignedintegerswithsurrogatepairs(UTF-16): fullUnicoderange UCS-4 use twice as much memory than UCS-2, but it supports all Unicode character. UTF-16 is a compromise betweenUCS-2andUCS-4: charactersintheBMPrangeuseoneUTF-16unit(16bits),charactersoutsidethisrange usetwoUTF-16units(asurrogatepair,32bits).Thisadvantageisalsothemaindisadvantageofthiskindofcharacter string. The length of a character string implemented using UTF-16 is the number of UTF-16 units, and not the number of characters, whichisconfusing. Forexample, theU+10FFFFcharacterisencoded astwoUTF-16units: {U+DBFF, U+DFFF}. If the character string only contains characters of the BMP range, the length is the number of charac- ters. Getting the nth character or the length in characters using UTF-16 has a complexity of 𝑂(𝑛), whereas it has a complexityof𝑂(1)forUCS-2andUCS-4strings. The Java language, the Qt library and Windows 2000 implement character strings with UTF-16. The C and Python languagesuseUTF-16orUCS-4dependingon:thesizeofthewchar_ttype(16or32bits)forC,andthecompilation mode(narroworwide)forPython. Windows95usesUCS-2strings. Seealso: UCS-2,UCS-4andUTF-16encodings,andsurrogatepairs. 3.6 Byte string Abytestringisacharacterstringencodedtoanencoding. Itisimplementedasanarrayof8bitsunsignedintegers. Itcanbecalledbyitsencoding. Forexample,abytestringencodedtoASCII iscalledan“ASCIIencodedstring”,or simplyan“ASCIIstring”. Thecharacterrangesupportedbyabytestringdependsonitsencoding,becauseanencodingisassociatedtoacharset. Forexample,anASCIIstringcanonlystorecharactersintherangeU+0000—U+007F. Theencodingisnotstoredexplicitlyinabytestring. Iftheencodingisnotdocumentedorattachedtothebytestring, theencodinghastobeguessed,whichisadifﬁculttask. Ifabytestringisdecodedfromthewrongencoding,itwill notbedisplayedcorrectly,leadingtoawellknownissue: mojibake. The same problem occurs if two byte strings encoded to different encodings are concatenated. Never concatenate bytestringsencodedtodifferentencodings! Usecharacterstrings,insteadofbytestrings,toavoidmojibakeissues. PHP5onlysupportsbytestrings. IntheClanguage,“strings”areusuallybytestringswhichareimplementedasthe char*type(orconst char*). Seealso: Thechar*typeoftheClanguageandthemojibakeissue. 3.7 UTF-8 encoded strings and UTF-16 character strings AUTF-8stringisaparticularcase,becauseUTF-8isabletoencodeallUnicodecharacters1 . ButaUTF-8stringis notaUnicodestringbecausethestringunitisbyteandnotcharacter: youcangetanindividualbyteofamultibyte character. 1AUTF-8encodershouldnotencodesurrogatecharacters(U+D800—U+DFFF). 6 Chapter3. Deﬁnitions

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.