使用
JSON库处理时会破坏UTF-8字符(可能这类似于
Problem with decoding unicode JSON in perl,但设置binmode只会产生另一个问题).
我已将问题减少到以下示例:
(hlovdal) localhost:/tmp/my_test>cat my_test.pl #!/usr/bin/perl -w use strict; use warnings; use JSON; use File::Slurp; use Getopt::Long; use Encode; my $set_binmode = 0; GetOptions("set-binmode" => \$set_binmode); if ($set_binmode) { binmode(STDIN,":encoding(UTF-8)"); binmode(STDOUT,":encoding(UTF-8)"); binmode(STDERR,":encoding(UTF-8)"); } sub check { my $text = shift; return "is_utf8(): " . (Encode::is_utf8($text) ? "1" : "0") . ",is_utf8(1): " . (Encode::is_utf8($text,1) ? "1" : "0"). ". "; } my $my_test = "hei på deg"; my $json_text = read_file('my_test.json'); my $hash_ref = JSON->new->utf8->decode($json_text); print check($my_test),"\$my_test = $my_test\n"; print check($json_text),"\$json_text = $json_text"; print check($$hash_ref{'my_test'}),"\$\$hash_ref{'my_test'} = " . $$hash_ref{'my_test'} . "\n"; (hlovdal) localhost:/tmp/my_test>
在运行测试时,文本由于某种原因被压缩到iso-8859-1中.设置binmode排序可以解决它,但随后会导致其他字符串的双重编码.
(hlovdal) localhost:/tmp/my_test>cat my_test.json { "my_test" : "hei på deg" } (hlovdal) localhost:/tmp/my_test>file my_test.json my_test.json: UTF-8 Unicode text (hlovdal) localhost:/tmp/my_test>hexdump -c my_test.json 0000000 { " m y _ t e s t " : " h 0000010 e i p 303 245 d e g " } \n 000001e (hlovdal) localhost:/tmp/my_test> (hlovdal) localhost:/tmp/my_test>perl my_test.pl is_utf8(): 0,is_utf8(1): 0. $my_test = hei på deg is_utf8(): 0,is_utf8(1): 0. $json_text = { "my_test" : "hei på deg" } is_utf8(): 1,is_utf8(1): 1. $$hash_ref{'my_test'} = hei p� deg (hlovdal) localhost:/tmp/my_test>perl my_test.pl --set-binmode is_utf8(): 0,is_utf8(1): 0. $my_test = hei pÃ¥ deg is_utf8(): 0,is_utf8(1): 0. $json_text = { "my_test" : "hei pÃ¥ deg" } is_utf8(): 1,is_utf8(1): 1. $$hash_ref{'my_test'} = hei på deg (hlovdal) localhost:/tmp/my_test>
是什么导致了这个以及如何解决?
这是一个新安装的和最新的Fedora 15系统.
(hlovdal) localhost:/tmp/my_test>perl --version | grep version This is perl 5,version 12,subversion 4 (v5.12.4) built for x86_64-linux-thread-multi (hlovdal) localhost:/tmp/my_test>rpm -q perl-JSON perl-JSON-2.51-1.fc15.noarch (hlovdal) localhost:/tmp/my_test>locale LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL= (hlovdal) localhost:/tmp/my_test>
更新:添加使用utf8无法解决它,字符仍然没有正确处理(虽然与以前略有不同):
(hlovdal) localhost:/tmp/my_test>perl my_test.pl is_utf8(): 1,is_utf8(1): 1. $my_test = hei p� deg is_utf8(): 0,is_utf8(1): 1. $$hash_ref{'my_test'} = hei p� deg (hlovdal) localhost:/tmp/my_test>perl my_test.pl --set-binmode is_utf8(): 1,is_utf8(1): 1. $my_test = hei på deg is_utf8(): 0,is_utf8(1): 1. $$hash_ref{'my_test'} = hei på deg (hlovdal) localhost:/tmp/my_test>
如perlunifaq所述
Can I use Unicode in my Perl sources?
Yes,you can! If your sources are
UTF-8 encoded,you can indicate that
with the use utf8 pragma.06004
This doesn’t do anything to your
input,or to your output. It only
influences the way your sources are
read. You can use Unicode in string
literals,in identifiers (but they
still have to be “word characters”
according to \w ),and even in custom
delimiters.
解决方法
你用UTF-8保存了程序,但忘了告诉Perl.添加使用utf8;.
而且,你编程太复杂了. JSON函数DWYM.要检查内容,请使用Devel :: Peek.
use utf8; # for the following line my $my_test = 'hei på deg'; use Devel::Peek qw(Dump); use File::Slurp (read_file); use JSON qw(decode_json); my $hash_ref = decode_json(read_file('my_test.json')); Dump $hash_ref; # Perl character strings Dump $my_test; # Perl character string