Java正则表达式运行速度很慢

我试图在 Java中使用 Daring Fireball Regular Expression for matching URLs,并且我发现一个URL,导致评估永远.我修改了原来的正则表达式以使用Java语法.

private final static String pattern = 
"\\b" + 
"(" +                            // Capture 1: entire matched URL
  "(?:" +
    "[a-z][\\w-]+:" +                // URL protocol and colon
    "(?:" +
      "/{1,3}" +                        // 1-3 slashes
      "|" +                             //   or
      "[a-z0-9%]" +                     // Single letter or digit or '%'
                                        // (Trying not to match e.g. "URI::Escape")
    ")" +
    "|" +                            //   or
    "www\\d{0,3}[.]" +               // "www.","www1.","www2." … "www999."
    "|" +                            //   or
    "[a-z0-9.\\-]+[.][a-z]{2,4}/" +  // looks like domain name followed by a slash
  ")" +
  "(?:" +                           // One or more:
    "[^\\s()<>]+" +                      // Run of non-space,non-()<>
    "|" +                               //   or
    "\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]+\\)))*\\)" +  // balanced parens,up to 2 levels
  ")+" +
  "(?:" +                           // End with:
    "\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]+\\)))*\\)" +  // balanced parens,up to 2 levels
    "|" +                                   //   or
    "[^\\s`!\\-()\\[\\]{};:'\".,<>?«»“”‘’]" +        // not a space or one of these punct chars (updated to add a 'dash'
  ")" +
")";

// @see http://daringfireball.net/2010/07/improved_regex_for_matching_urls
private static final Pattern DARING_FIREBALL_PATTERN = Pattern.compile(pattern,Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);

如果我尝试运行以下操作,它将永远存在.我把它缩小到了平衡括号的匹配(我想).如果你更改了括号内的文本,它的工作正常,但是大约15个字符,它开始慢下来.

final Matcher matcher = pattern.matcher("https://goo.gl/a(something_really_long_in_balanced_parens)");
boolean found = matcher.find();

有没有办法来改善这个正则表达式,以至于不会永远存在的线条？我在JUnit测试类中有大约100个不同的URL,我需要继续工作.

解决方法

问题在这里：

"(?:" +                           // One or more:
"[^\\s()<>]+" +                      // Run of non-space,non-()<>
"|" +                               //   or
"\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]+\\)))*\\)" +  // balanced parens,up to 2 levels
")+"

你在这里是嵌套量词.这与任何回溯算法的破坏 – 例如,考虑到正则表达式/ ^(a)$/匹配字符串

aaaaaaaaaab

作为第一次尝试,内部量词将匹配所有的.然后正则表达式失败,所以它退出了一个.然后外部量词组尝试再次匹配,吞下最后一个,然后正则表达式再次失败.我们基本上获得指数的行为,因为量词尝试各种各样的分裂方式,而不实际上取得任何进展.

解决方案是占有量词(我们通过定位到量词的末尾来表示) – 我们设置内部量词,以便一旦匹配,他们就不会让它走 – 他们会坚持到匹配失败或较早的量词后退,并且必须从字符串中的其他位置重新开始.如果我们使用/ ^(a)$/作为我们的正则表达式,我们将在上面的不匹配的字符串上立即失败,而不是指数地尝试匹配它.

尝试使这些内部量词占有,看看它是否有帮助.

Java正则表达式运行速度很慢

解决方法

猜你在找的Java相关文章