微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

windows – 如何在解析后保留多字节字符()

当我在 Windows下使用非本机字符解析R代码时,这些字符似乎变成了它们的Unicode表示形式,例如
Encoding('ğ')
# [1] "UTF-8"
parse(text="'ğ'")
# expression('<U+011F>')
parse(text="'ğ'",encoding='UTF-8')
# expression('<U+011F>')
deparse(parse(text="'ğ'")[1])
# [1] "expression(\"<U+011F>\")"
eval(parse(text="'ğ'"))
# [1] "<U+011F>"

由于我的语言环境是简体中文,我可以解析具有中文字符的代码而没有这样的问题,例如:

parse(text="'你好'")
# expression('你好')

我的问题是,我如何在这个例子中保留字母ğ等字符?或者至少在我解析表达式之后如何“重建”原始字符?

我的会话信息:

> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: i386-w64-mingw32/i386 (32-bit)

locale:
[1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936 
[2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936   
[3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936
[4] LC_NUMERIC=C                                                   
[5] LC_TIME=Chinese (Simplified)_People's Republic of China.936    

attached base packages:
[1] stats     graphics  Grdevices utils     datasets  methods   base
问题的根源是(引用 R Installation and administration manual):“R支持底层操作系统可以处理的所有字符集.这些字符集根据当前语言环境进行解释”.不幸的是Windows has no locale supporting UTF-8.

现在,好的是Rgui apparently supports UTF-8(向下滚动到2.7.0>国际化).但是,R解析器仅适用于语言环境中支持的字符.因此,对我有用的解决方案是暂时使用Sys.setlocale()更改R语言环境以进行解析,稍后当使用iconv()转换为UTF-8时:

> Sys.getlocale()
[1] "LC_COLLATE=Greek_Greece.1253;LC_CTYPE=Greek_Greece.1253;LC_MONETARY=Greek_Greece.1253;LC_NUMERIC=C;LC_TIME=Greek_Greece.1253"
> orig.locale <- Sys.getlocale("LC_CTYPE")
> parse(text="'你好'")
expression('<U+4F60><U+597D>')
> Sys.setlocale(locale="Chinese")
[1] "LC_COLLATE=Chinese (Simplified)_People's Republic of China.936;LC_CTYPE=Chinese (Simplified)_People's Republic of China.936;LC_MONETARY=Chinese (Simplified)_People's Republic of China.936;LC_NUMERIC=C;LC_TIME=Chinese (Simplified)_People's Republic of China.936"
> a <- parse(text="'你好'")
> a
expression('你好')
> Sys.setlocale(locale="Turkish")
[1] "LC_COLLATE=Turkish_Turkey.1254;LC_CTYPE=Turkish_Turkey.1254;LC_MONETARY=Turkish_Turkey.1254;LC_NUMERIC=C;LC_TIME=Turkish_Turkey.1254"
> b <- parse(text="'ğ'")
> b
expression('ğ')
> Sys.setlocale(locale=orig.locale)
[1] "LC_COLLATE=Greek_Greece.1253;LC_CTYPE=Greek_Greece.1253;LC_MONETARY=Greek_Greece.1253;LC_NUMERIC=C;LC_TIME=Greek_Greece.1253"
> a
[1] expression('ΔγΊΓ')
> b
[1] expression('π')
> ai <- iconv(a,from="CP936",to="UTF-8")
> ai
[1] "你好"
> bi <- iconv(b,from="CP1254",to="UTF-8")
> bi
[1] "ğ"

希望这可以帮助!

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐