以美国驻捷克共和国大使馆的地址页为例(http://prague.usembassy.gov/contact.html).我想要的只是撤回地址:
Address: Tržiště 15 118 01 Praha 1 – Malá Strana Czech Republic
哪个firefox使用字符编码UTF-8正确显示,UTF-8与网页标题字符集相同.但是当我尝试使用perl将其拉回并将其写入文件时,尽管在Useragent或Encode :: decode中使用了decoding_content,编码看起来仍然搞砸了.
我已经尝试在数据上使用正则表达式来检查错误是不是在打印数据时(即内部在perl中正确)但错误似乎在于perl如何处理编码.
这是我的代码:
#!/usr/bin/perl require Encode; require LWP::UserAgent; use utf8; my $ua = LWP::UserAgent->new; $ua->timeout(30); $ua->env_proxy; my $output_file; $output_file = "C:/Documents and Settings/ian/Desktop/utf8test.txt"; open (OUTPUTFILE,">$output_file") or die("Could not open output file $output_file: $!" ); binmode OUTPUTFILE,":utf8"; binmode STDOUT,":utf8"; # US embassy in Czech Republic webpage $url = "http://prague.usembassy.gov/contact.html"; $ua_response = $ua->get($url); if (!$ua_response->is_success) { die "Couldn't get data from $url";} print 'CONTENT TYPE: '.$ua_response->content_charset."\n"; print OUTPUTFILE 'CONTENT TYPE: '.$ua_response->content_charset."\n"; my $content_not_decoded; my $content_ua_decoded; my $content_Endode_decoded; my $content_double_decoded; $ua_response->content =~ /<p><b>Address(.*?)<\/p>/; $content_not_decoded = $1; $ua_response->decoded_content =~ /<p><b>Address(.*?)<\/p>/; $content_ua_decoded = $1; Encode::decode_utf8($ua_response->content) =~ /<p><b>Address(.*?)<\/p>/; $content_Endode_decoded = $1; Encode::decode_utf8($ua_response->content) =~ /<p><b>Address(.*?)<\/p>/; $content_double_decoded = $1; # get the content without decoding print 'UNDECODED CONTENT:'.$content_not_decoded."\n"; print OUTPUTFILE 'UNDECODED CONTENT:'.$content_not_decoded."\n"; # print the decoded content print 'DECODED CONTENT:'.$content_ua_decoded."\n"; print OUTPUTFILE 'DECODED CONTENT:'.$content_ua_decoded."\n"; # use Encode to decode the content print 'ENCODE::DECODED CONTENT:'.$content_Endode_decoded."\n"; print OUTPUTFILE 'ENCODE::DECODED CONTENT:'.$content_Endode_decoded."\n"; # try both! print 'DOUBLE-DECODED CONTENT:'.$content_double_decoded."\n"; print OUTPUTFILE 'DOUBLE-DECODED CONTENT:'.$content_double_decoded."\n"; # check for #-digit character in the strings (to guard against the error coming in the print statement) if ($content_not_decoded =~ /\&/) { print "AMPERSAND FOUND IN UNDECODED CONTENT- LIKELY ENCODING ERROR\n"; print OUTPUTFILE "AMPERSAND FOUND IN UNDECODED CONTENT- LIKELY ENCODING ERROR\n"; } if ($content_ua_decoded =~ /\&/) { print "AMPERSAND FOUND IN DECODED CONTENT- LIKELY ENCODING ERROR\n"; print OUTPUTFILE "AMPERSAND FOUND IN DECODED CONTENT- LIKELY ENCODING ERROR\n"; } if ($content_Endode_decoded =~ /\&/) { print "AMPERSAND FOUND IN ENCODE::DECODED CONTENT- LIKELY ENCODING ERROR\n"; print OUTPUTFILE "AMPERSAND FOUND IN ENCODE::DECODED CONTENT- LIKELY ENCODING ERROR\n"; } if ($content_double_decoded =~ /\&/) { print "AMPERSAND FOUND IN DOUBLE-DECODED CONTENT- LIKELY ENCODING ERROR\n"; print OUTPUTFILE "AMPERSAND FOUND IN DOUBLE-DECODED CONTENT- LIKELY ENCODING ERROR\n"; } close (OUTPUTFILE); exit;
这是终端的输出:
CONTENT TYPE: UTF-8 UNDECODED CONTENT::
Tr├à┬╛išt├ä┬¢
15
118 01 Praha 1 – Malá Strana
Czech Republic
DECODED CONTENT::
Tr┼╛išt─¢ 15
118 01 Praha 1 –
Malá Strana
Czech Republic ENCODE::DECODED
CONTENT::
Tr┼╛išt─¢ 15
118 01 Praha 1 –
Malá Strana
Czech Republic DOUBLE-DECODED CONTENT::Tr┼╛išt─¢ 15
118 01 Praha 1 – Malá StranaCzech Republic AMPERSAND FOUND IN UNDECODED CONTENT- LIKELY ENCODING
ERROR AMPERSAND FOUND IN DECODED CONTENT- LIKELY ENCODING ERROR
AMPERSAND FOUND IN ENCODE::DECODED CONTENT- LIKELY ENCODING ERROR
AMPERSAND FOUND IN DOUBLE-DECODED CONTENT- LIKELY ENCODING ERROR
并且到文件(注意这与终端略有不同但不正确). OK WOW-这在堆栈溢出中显示正确,但在Bluefish,LibreOffice,Excel,Word或我的计算机上的任何其他内容中都没有.所以数据只是编码不正确.我真的不知道发生了什么.
CONTENT TYPE: UTF-8 UNDECODED CONTENT::
TržištÄ
15
118 01 Praha 1 – Malá Strana
Czech Republic
DECODED CONTENT::
Tržiště 15
118 01 Praha 1 –
Malá Strana
Czech Republic ENCODE::DECODED
CONTENT::
Tržiště 15
118 01 Praha 1 – Malá
Strana
Czech Republic DOUBLE-DECODED CONTENT::Tržiště 15
118 01 Praha 1 – Malá StranaCzech Republic AMPERSAND FOUND IN UNDECODED CONTENT- LIKELY
ENCODING ERROR AMPERSAND FOUND IN DECODED CONTENT- LIKELY ENCODING
ERROR AMPERSAND FOUND IN ENCODE::DECODED CONTENT- LIKELY ENCODING
ERROR AMPERSAND FOUND IN DOUBLE-DECODED CONTENT- LIKELY ENCODING ERROR
任何指示如何做到这一点真的很感激.
谢谢,
伊恩/蒙特克里斯托
解决方法
use strictures; use Web::Query 'wq'; use autodie qw(:all); open my $output,'>:encoding(UTF-8)','/tmp/embassy-prague.txt'; print {$output} wq('http://prague.usembassy.gov/contact.html')->find('p')->first->html; # or perhaps ->text