regex – 为什么正则表达式有一个指数运行时间？

可以写一个正则表达式，在某些情况下需要指数运行时间。这样的例子是(aa | aa)*。如果有一个奇数的输入，因为它需要指数运行时间。

很容易测试这个。如果输入只包含as并且长度为51，则Regex需要几秒钟来计算(在我的机器上)。相反，如果输入长度为52，其计算时间不显着(我使用JavaRE的内置Regex解析器测试)。

我写了一个正则表达式解析器找到这种行为的原因，但我没有找到它。我的解析器可以基于正则表达式构建一个AST或NFA。之后，它可以翻译NFA到DFA.要做到这一点，它使用powerset construction algorithm。

当我解析上述Rgex时，解析器创建一个具有7个状态的NFA – 转换后，DFA中只剩下3个状态。 DFA表示更合理的Regex(aa)*，它可以非常快速地解析。

因此，我不明白为什么有解析器可以这么慢。这是什么原因？他们不是将NFA翻译成DFA吗？如果是，为什么不呢？他们计算速度这么慢的技术原因是什么？

Russ Cox has a very detailed article about why this is and the history of regexes( part 2， part 3)。

Regular expression matching can be simple and fast,using finite automata-based techniques that have been known for decades. In contrast,Perl,PCRE,Python,Ruby,Java,and many other languages have regular expression implementations based on recursive backtracking that are simple but can be excruciatingly slow. With the exception of backreferences,the features provided by the slow backtracking implementations can be provided by the automata-based implementations at dramatically faster,more consistent speeds.

大多数情况下，归结为“常规”表达式中的非正则特征的泛滥，例如反向引用，以及大多数程序员的(继续)无知，对于不包含这样的特征的正则表达式(其中许多是更好的替代) 。

While writing the text editor sam in the early 1980s,Rob Pike wrote a new regular expression implementation,which Dave Presotto extracted into a library that appeared in the Eighth Edition. Pike’s implementation incorporated submatch tracking into an efficient NFA simulation but,like the rest of the Eighth Edition source,was not widely distributed. Pike himself did not realize that his technique was anything new. Henry Spencer reimplemented the Eighth Edition library interface from scratch,but using backtracking,and released his implementation into the public domain. It became very widely used,eventually serving as the basis for the slow regular expression implementations mentioned earlier: Perl,and so on. (In his defense,Spencer knew the routines could be slow,and he didn’t know that a more efficient algorithm existed. He even warned in the documentation,“Many users have found the speed perfectly adequate,although replacing the insides of egrep with this code would be a mistake.”) Pike’s regular expression implementation,extended to support Unicode,was made freely available with sam in late 1992,but the particularly efficient regular expression search algorithm went unnoticed.

regex – 为什么正则表达式有一个指数运行时间？

猜你在找的正则表达式相关文章