当我在
Windows下使用非本机字符解析R代码时,这些字符似乎变成了它们的Unicode表示形式,例如
Encoding('ğ') # [1] "UTF-8" parse(text="'ğ'") # expression('<U+011F>') parse(text="'ğ'",encoding='UTF-8') # expression('<U+011F>') deparse(parse(text="'ğ'")[1]) # [1] "expression(\"<U+011F>\")" eval(parse(text="'ğ'")) # [1] "<U+011F>"
由于我的语言环境是简体中文,我可以解析具有中文字符的代码而没有这样的问题,例如:
parse(text="'你好'") # expression('你好')
我的问题是,我如何在这个例子中保留字母ğ等字符?或者至少在我解析表达式之后如何“重建”原始字符?
我的会话信息:
> sessionInfo() R version 2.15.2 (2012-10-26) Platform: i386-w64-mingw32/i386 (32-bit) locale: [1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936 [2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936 [3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936 [4] LC_NUMERIC=C [5] LC_TIME=Chinese (Simplified)_People's Republic of China.936 attached base packages: [1] stats graphics grDevices utils datasets methods base
问题的根源是(引用
R Installation and administration manual):“R支持底层操作系统可以处理的所有字符集.这些字符集根据当前语言环境进行解释”.不幸的是Windows
has no locale supporting UTF-8.
原文链接:https://www.f2er.com/windows/364548.html现在,好的是Rgui apparently supports UTF-8(向下滚动到2.7.0>国际化).但是,R解析器仅适用于语言环境中支持的字符.因此,对我有用的解决方案是暂时使用Sys.setlocale()更改R语言环境以进行解析,稍后当使用iconv()转换为UTF-8时:
> Sys.getlocale() [1] "LC_COLLATE=Greek_Greece.1253;LC_CTYPE=Greek_Greece.1253;LC_MONETARY=Greek_Greece.1253;LC_NUMERIC=C;LC_TIME=Greek_Greece.1253" > orig.locale <- Sys.getlocale("LC_CTYPE") > parse(text="'你好'") expression('<U+4F60><U+597D>') > Sys.setlocale(locale="Chinese") [1] "LC_COLLATE=Chinese (Simplified)_People's Republic of China.936;LC_CTYPE=Chinese (Simplified)_People's Republic of China.936;LC_MONETARY=Chinese (Simplified)_People's Republic of China.936;LC_NUMERIC=C;LC_TIME=Chinese (Simplified)_People's Republic of China.936" > a <- parse(text="'你好'") > a expression('你好') > Sys.setlocale(locale="Turkish") [1] "LC_COLLATE=Turkish_Turkey.1254;LC_CTYPE=Turkish_Turkey.1254;LC_MONETARY=Turkish_Turkey.1254;LC_NUMERIC=C;LC_TIME=Turkish_Turkey.1254" > b <- parse(text="'ğ'") > b expression('ğ') > Sys.setlocale(locale=orig.locale) [1] "LC_COLLATE=Greek_Greece.1253;LC_CTYPE=Greek_Greece.1253;LC_MONETARY=Greek_Greece.1253;LC_NUMERIC=C;LC_TIME=Greek_Greece.1253" > a [1] expression('ΔγΊΓ') > b [1] expression('π') > ai <- iconv(a,from="CP936",to="UTF-8") > ai [1] "你好" > bi <- iconv(b,from="CP1254",to="UTF-8") > bi [1] "ğ"
希望这可以帮助!