我有一个巨大的文件aab.txt,其内容是aaa … aab.
令我惊讶的是
perl -ne '/a*bb/' < aab.txt
比运行(匹配失败)更快
perl -ne '/a*b/' < aab.txt
(匹配成功).为什么????两者都应首先吞噬所有的a,然后第二个立即成功,而第一个将不得不一遍又一遍地回溯,失败.
Perl正则表达式被优化为尽可能早地失败,而不是尽可能快地成功.在浏览大型日志文件时,这很有意义.
有一个优化,首先查找字符串的常量部分,在这种情况下,是“浮动”b或bb.这可以相当有效地检查,而不必跟踪回溯状态.没有找到bb,比赛就在那里中止.
b不是这样.找到浮动子字符串,并从那里构造匹配.这是正则表达式匹配的调试输出(程序是“aaab”=〜/ a * b /):
Compiling REx "a*b" synthetic stclass "ANYOF_SYNTHETIC[ab][]". Final program: 1: STAR (4) 2: EXACT <a> (0) 4: EXACT <b> (6) 6: END (0) floating "b" at 0..2147483647 (checking floating) stclass ANYOF_SYNTHETIC[ab][] minlen 1 Guessing start of match in sv for REx "a*b" against "aaab" Found floating substr "b" at offset 3... start_shift: 0 check_at: 3 s: 0 endpos: 4 checked_upto: 0 Does not contradict STCLASS... Guessed: match at offset 0 Matching REx "a*b" against "aaab" Matching stclass ANYOF_SYNTHETIC[ab][] against "aaab" (4 bytes) 0 <> <aaab> | 1:STAR(4) EXACT <a> can match 3 times out of 2147483647... 3 <aaa> <b> | 4: EXACT <b>(6) 4 <aaab> <> | 6: END(0) Match successful! Freeing REx: "a*b"
您可以使用re pragma的debug选项获得此类输出.
严格来说,查找b或bb是不必要的,但它允许匹配更早失败.