例如,在“Îñţérñåţîöñåļîžåţîöñ”中匹配“Nation”,而不需要额外的模块。在新的Perl版本(5.14,5.15等)中是否可能?
I found an answer! Thanks to tchrist
Rigth解决方案与UCA匹配(thnx到http://stackoverflow.com/users/471272/tchrist)。
# found start/end offsets for matched utf-substring (without intersections) use 5.014; use strict; use warnings; use utf8; use Unicode::Collate; binmode STDOUT,':encoding(UTF-8)'; my $str = "Îñţérñåţîöñåļîžåţîöñ" x 2; my $look = "Nation"; my $Collator = Unicode::Collate->new( normalization => undef,level => 1 ); my @match = $Collator->match($str,$look); if (@match) { my $found = $match[0]; my $f_len = length($found); say "match result: $found (length is $f_len)"; my $offset = 0; while ((my $start = index($str,$found,$offset)) != -1) { my $end = $start + $f_len; say sprintf("found at: %s,%s",$start,$end); $offset = $end + 1; } }
http://www.perlmonks.org/?node_id=485681错误(但工作)解决方案
Magic piece of code is:
$str = Unicode::Normalize::NFD($str); $str =~ s/\pM//g;
code example:
use 5.014; use utf8; use Unicode::Normalize; binmode STDOUT,':encoding(UTF-8)'; my $str = "Îñţérñåţîöñåļîžåţîöñ"; my $look = "Nation"; say "before: $str\n"; $str = NFD($str); # M is short alias for \p{Mark} (http://perldoc.perl.org/perluniprops.html) $str =~ s/\pM//og; # remove "marks" say "after: $str";¬ say "is_match: ",$str =~ /$look/i || 0;
“没有额外的模块”是什么意思?
这是一个使用Unicode :: Normalize的解决方案see on perl doc
我从你的字符串中删除了“ţ”和“ļ”,我的日食不想和他们一起保存脚本。
use strict; use warnings; use UTF8; use Unicode::Normalize; my $str = "Îñtérñåtîöñålîžåtîöñ"; for ( $str ) { # the variable we work on ## convert to Unicode first ## if your data comes in Latin-1,then uncomment: #$_ = Encode::decode( 'iso-8859-1',$_ ); $_ = NFD( $_ ); ## decompose s/\pM//g; ## strip combining characters s/[^\0-\x80]//g; ## clear everything else } if ($str =~ /nation/) { print $str . "\n"; }
输出是
Internationaliation
“ž”从字符串中删除,似乎不是一个组合的字符。
for循环的代码是从这边How to remove diacritic marks from characters
另一个有趣的读者是Joel Spolsky的The Absolute Minimum Every Software Developer Absolutely,Positively Must Know About Unicode and Character Sets (No Excuses!)
更新:
正如@tchrist所指出的那样,存在一种称为UCA(Unicode排序算法)的算法。 @nordicdyno已经在他的问题中提供了一个实现。
算法在这里描述Unicode Technical Standard #10,Unicode Collation Algorithm
perl模块在这里描述为perldoc.perl.org