Perl vs Javascript正则表达式

为什么以下正则表达式(通过捕获组)在 Javascript中捕获字符串’abc’,但在PCRE中不捕获(尽管它仍然匹配)？

(.*)*

解决方法

这就是PCRE中捕获组为空的原因：

>初始状态

(.*)*     abc
 ^        ^

>首先,(.*)部分与abc匹配,输入位置前进到结尾.捕获组此时包含abc.

(.*)*     abc
    ^        ^

>现在,输入位置在c字符之后,其余输入是空字符串. Kleene星开始第二次尝试匹配(.*)组：

(.*)*     abc
 ^           ^

>(.*)组匹配abc后的空字符串.由于匹配,先前捕获的字符串将被覆盖.
>由于输入位置没有前进,*结束迭代,匹配成功.

JS和PCRE之间的行为差异是由于指定了正则表达式引擎的方式. PCRE的行为与Perl一致：

PCRE：

$pcretest
PCRE version 8.39 2016-06-14

  re> /(.*)*/
data> abc
 0: abc
 1:

Perl的：

$perl -e '"abc" =~ /(.*)*/; print "<$&> <$1>\n";'
<abc> <>

设compare this with .NET,它具有相同的行为,但支持多次捕获：

当捕获组第二次匹配时,.NET会将捕获的值添加到捕获堆栈. Perl和PCRE将简单地覆盖它.

至于JavaScript：

这里是ECMA-262§21.2.2.5.1运行时语义：RepeatMatcher抽象操作：

The abstract operation RepeatMatcher takes eight parameters,a Matcher m@H_502_48@,an integer min@H_502_48@,an integer (or ∞) max@H_502_48@,a Boolean greedy@H_502_48@,a State x@H_502_48@,a Continuation c@H_502_48@,an integer parenIndex@H_502_48@,and an integer parenCount@H_502_48@,and performs the following steps:
If max@H_502_48@ is zero,return c(x)@H_502_48@. Create an internal Continuation closure d@H_502_48@ that takes one State argument y@H_502_48@ and performs the following steps when evaluated: a. If min@H_502_48@ is zero and y@H_502_48@‘s endIndex@H_502_48@ is equal to x@H_502_48@‘s endIndex@H_502_48@,return failure@H_502_48@. b. If min@H_502_48@ is zero,let min2@H_502_48@ be zero; otherwise let min2@H_502_48@ be min‑1@H_502_48@. c. If max@H_502_48@ is ∞,let max2@H_502_48@ be ∞; otherwise let max2@H_502_48@ be max‑1@H_502_48@. d. Call RepeatMatcher(m,min2,max2,greedy,y,c,parenIndex,parenCount)@H_502_48@ and return its result. Let cap@H_502_48@ be a fresh copy of x@H_502_48@‘s captures List. For every integer k@H_502_48@ that satisfies parenIndex < k@H_502_48@ and k ≤ parenIndex+parenCount@H_502_48@,set cap[k]@H_502_48@ to undefined@H_502_48@. Let e@H_502_48@ be x@H_502_48@‘s endIndex. Let xr@H_502_48@ be the State (e,cap)@H_502_48@. If min@H_502_48@ is not zero,return m(xr,d)@H_502_48@. If greedy@H_502_48@ is false@H_502_48@,then a. Call c(x)@H_502_48@ and let z@H_502_48@ be its result. b. If z@H_502_48@ is not failure@H_502_48@,return z@H_502_48@. c. Call m(xr,d)@H_502_48@ and return its result. Call m(xr,d)@H_502_48@ and let z@H_502_48@ be its result. If z@H_502_48@ is not failure@H_502_48@,return z@H_502_48@. Call c(x)@H_502_48@ and return its result.

 
 这基本上是对量词进行评估时应该发生什么的定义. RepeatMatcher是处理内部操作m的匹配的操作. 
 你还需要了解一个国家是什么(§21.2.2.1,强调我的)： 
  
  
  A State is an ordered pair (endIndex@H_502_48@,captures@H_502_48@) where endIndex@H_502_48@ is an integer and captures is a List of NcapturingParens@H_502_48@ values. States are used to represent partial match states in the regular expression matching algorithms. The endIndex@H_502_48@ is one plus the index of the last input character matched so far by the pattern,while captures@H_502_48@ holds the results of capturing parentheses. The n@H_502_48@th element of captures@H_502_48@ is either a List that represents the value obtained by the n@H_502_48@th set of capturing parentheses or undefined if the n@H_502_48@th set of capturing parentheses hasn’t been reached yet. Due to backtracking,many
 States may be in use at any time during the matching process.
 
 
 
 对于您的示例,RepeatMatcher参数是： 
 > m：负责处理子模式的匹配器(.*)
> min：0(Kleene星形量词的最小匹配数)
> max：∞(Kleene星形量词的最大匹配数)
>贪婪：真实(使用的Kleene星形量词是贪婪的)
> x：(0,[undefined])(参见上面的状态定义)
> c：延续,此时它将是：在折叠父规则后,一直将其State参数作为成功的MatchResult返回的Continuation
> parenIndex：0(根据§21.2.2.5,这是在左边的整个正则表达式中左捕获括号的数量)
生产)
> parenCount：1(相同的spec段落,这是该生产的Atom扩展中左侧捕获括号的数量 – 我不会在此处粘贴完整的规范,但这基本上意味着m定义了一个捕获组) 
 规范中的正则表达式匹配算法是根据continuation-passing style定义的.基本上,这意味着c操作意味着接下来应该发生什么. 
 让我们展开这个算法. 
 第一次迭代 
 在第一次传递时,x1状态为(0,[undefined]). 
 > max不为零
>我们在此时创建了延续闭包d1,它将在第二遍中使用,所以我们稍后会再回到这一点.
>制作捕获列表的副本cap1
>将cap1中的捕捉重置为未定义,这是第一遍中的无操作
>设e1 = 0
>设xr1 =(e1,cap1)
> min为零,跳过此步骤
>贪婪是真的,跳过这一步
>设z1 = m(xr,d1) – 这会计运算符模式(.*) 
 现在让我们退一步 – m将匹配(.*)对抗abc,然后调用我们的d1闭包,所以让我们展开那个. 
 用状态y1 =(3,[“abc”])评估d1： 
 > min为0,但y1的endIndex为3,x1的endIndex为0,所以不要返回失败
>设min 2 = min = 0,因为min = 0
>设max2 = max =∞,因为max =∞
>调用RepeatMatcher(m,parenCount)并返回其结果.那就是：RepeatMatcher(m,∞,false,y1,1) 
 第二次迭代 
 所以,现在我们要进行第二次迭代,x2 = y1 =(3,[“abc”]). 
 > max不为零
>此时我们创建了延续闭包d2
>制作捕获列表[“abc”]的副本cap2
>将cap2中的捕获重置为undefined,我们得到cap2 = [undefined]
>设e2 = 3
>设xr2 =(e2,cap2)
> min为零,跳过这一步
>设z2 = m(xr2,d2) – 这会评估子模式(.*) 
 这次m将匹配abc之后的空字符串,并用那个调用我们的d2闭包.让我们来评估d2的作用.它的参数是y2 =(3,[“”]) 
 min仍为0,但y2的endIndex为3,x2的endIndex也为3(请记住,此时x是前一次迭代的y状态),闭包只会返回失败.
> z2失败,跳过此步骤
>在此迭代中返回c(x2),即c((3,[“abc”])). 
 c只是返回一个有效的MatchResult,因为我们在模式的末尾.这意味着d1返回此结果,并且第一次迭代返回从步骤10传递它. 
 基本上,正如您所看到的,导致JS行为与PCRE不同的规范行如下： 
  
  
  a. If min@H_502_48@ is zero and y@H_502_48@‘s endIndex@H_502_48@ is equal to x@H_502_48@‘s endIndex@H_502_48@,return failure@H_502_48@.
 
 
 
 结合使用时： 
  
  
   
   Call c(x)@H_502_48@ and return its result.
 
  
 
 
 
 如果迭代失败,则返回先前捕获的值.

Perl vs Javascript正则表达式

解决方法

猜你在找的Perl相关文章