在获得正确的输出之前我需要很长时间才能搞清楚,我想知道我的解决方案是否是正确的方法,因为它似乎在Perl的其他语言方面过于复杂.特别是:
>为什么Perl使用正确的BOM编码(UTF-16)制作有效的UTF-16大端,而如果我使用UTF-16LE或UTF-16BE而不使用额外的包File :: BOM,则没有BOM?
>为什么开箱即用的CRLF处理似乎有问题(它输出为0D 0A 00而不是0D 00 0A 00)没有过滤器的一些麻烦?我怀疑这对于拥有这么多用户的语言来说可能是一个真正的错误……
以下是我的评论尝试,我发现正确的是最后的陈述
use strict; use warnings; use utf8; use File::BOM; use feature 'say'; my $UTF; my $data = "Hello,héhé,中文.\nsecond line : my 2€"; # 中文 = zhong wen = chinese # UTF16 BE + BOM but incorrect CRLF: "0D 0A 00" instead of "0D 00 0A 00" open $UTF,">:encoding(UTF-16)","utf-16-std-be.txt" or die $!; say $UTF $data; close $UTF; # same as UTF-16BE (no BOM,incorrect CRLF) open $UTF,">:encoding(ucs2)","utf-ucs2.txt" or die $!; say $UTF $data; close $UTF; # UTF16 BE,no BOM,incorrect CRLF open $UTF,">:encoding(UTF-16BE)","utf-16-be-nobom.txt" or die $!; say $UTF $data; close $UTF; # UTF16 LE,">:encoding(UTF-16LE)","utf-16-le-nobom-wrongcrlf.txt" or die $!; say $UTF $data; close $UTF; # UTF16 LE,BOM OK but still incorrect CRLF open $UTF,">:encoding(UTF-16LE):via(File::BOM)","utf-16-le-bom-wrongcrlf.txt" or die $!; say $UTF $data; close $UTF; # UTF16 LE non raw incorrect # (crlf by default on windows) -> 0A => 0D 0A open $UTF,"utf-16-le-bom-wrongcrlf2.txt" or die $!; print $UTF $data,"\x0a"; # 0A is magically expanded to 0D 0A but wrong close $UTF; # UTF16 LE + BOM + LF # raw -> 0A => 0A # could be correct on UNIX but I need CRLF open $UTF,">raw::encoding(UTF-16LE):via(File::BOM)","utf-16-le-bom-wrongcrlf3.txt" or die $!; say $UTF $data; close $UTF; # manual BOM,but CRLF OK open $UTF,">:raw:encoding(UTF-16LE):crlf","utf-16-le-bommanual-crlfok.txt" or die $!; print $UTF "\x{FEFF}"; say $UTF $data; close $UTF; #auto BOM,CRLF OK ? #incorrect,says utf8 "\xA9" does not map to Unicode at c:/perl/Dwimperl-5.14/perl/lib/Encode.pm line 176. # But I cannot see where the A9 comes from ??! #~ open $UTF,">:raw:encoding(UTF-16LE):via(File::BOM):crlf","utf-16-le-autobom-crlfok1.txt" or die $!; #~ print $UTF $data; #~ say $UTF $data; #~ close $UTF; # WTF? \n becomes 0D 00 0D 0A 00 open $UTF,">:encoding(UTF-16LE):crlf:via(File::BOM)","utf-16-le-autobom-crlf2.txt" or die $!; say $UTF $data; close $UTF; #CORRECT WAY?? : Automatic BOM,CRLF is OK open $UTF,">:raw:encoding(UTF-16LE):crlf:via(File::BOM)","utf-16-le-autobom-crlfok3.txt" or die $!; say $UTF $data; close $UTF;
解决方法
manual BOM,but CRLF OK
是的,以下确实是正确的:
:raw:encoding(UTF-16LE):crlf + manual BOM
>:raw“清除”现有的:crlf和:编码层.
>:编码在字节和代码点之间转换.
>:crlf在CRLF和LF之间转换.
所以,
Read ===================================================> Code Code +------+ bytes +------+ Points +-------+ Points +------+ | File |-----------| :enc |------------| :crlf |------------| Code | +------+ +------+ CRLF +-------+ LF +------+ <=================================================== Write
您希望对代码点(而不是字节)执行CRLF⇔LF转换,就像使用此设置一样.
CORRECT WAY?? : Automatic BOM,CRLF is OK
while:raw:encoding(UTF-16LE):crlf:via(File :: BOM)可能适用于写句柄,看起来不对(我原以为:raw:via(File :: BOM,UTF-) 16LE):crlf),它对于一个读取句柄来说是悲惨的(至少对我来说是Perl 5.16.3).
我只是看了,背后的代码:via(File :: BOM)做了一些非常值得怀疑的事情.我不会用它.
why Perl is making a valid UTF-16 big-endian with correct BOM with encoding(UTF-16) while there is no BOM if I use either UTF-16LE or UTF-16BE without using an additional package File::BOM
因为您可能不需要BOM.
why out-of-the-Box the
CRLF
handling seems buggy
添加图层会在列表末尾添加它们.如果要在其他位置添加图层(如此处所示),则需要重建列表.
在Perl的开发列表中建议应该区分字节层(例如:unix)和文本层(例如:crlf),并且添加字节或编码层应该挖掘并将其放置在适当的位置点.但是还没有人对此采取行动.
除了简化代码之外,它还允许将UTF-16 * [1]编码层添加到STDIN / STDOUT / STDERR(或其他现有句柄).我相信目前还不可能.
>从技术上讲,CR!= 13或LF!= 10的任何编码都有这个问题,因此EBCDIC也会受到影响.