在我的Python应用程序中,我需要编写一个正则表达式,该表达式匹配已经用分号(;)终止的C for或while循环。例如,它应该匹配:
for (int i = 0; i < 10; i++);
…但不是这样:
for (int i = 0; i < 10; i++)
这看起来很琐碎,直到你意识到开始和结束括号之间的文本可能包含其他括号,例如:
for (int i = funcA(); i < funcB(); i++);
我使用python.re模块。现在我的正则表达式看起来像这样(我留下了我的意见,所以你可以更容易理解):
# match any line that begins with a "for" or "while" statement: ^\s*(for|while)\s* \( # match the initial opening parenthesis # Now make a named group 'balanced' which matches a balanced substring. (?P<balanced> # A balanced substring is either something that is not a parenthesis: [^()] | # …or a parenthesised string: \( # A parenthesised string begins with an opening parenthesis (?P=balanced)* # …followed by a sequence of balanced substrings \) # …and ends with a closing parenthesis )* # Look for a sequence of balanced substrings \) # Finally,the outer closing parenthesis. # must end with a semi-colon to match: \s*;\s*
这对于所有上述情况都是完美的,但是一旦你尝试并且使for循环的第三部分包含一个函数,它就会中断,像这样:
for (int i = 0; i < 10; doSomethingTo(i));
我认为它打破,因为只要你在开始和结束括号之间放置一些文本,“平衡”组匹配包含文本,因此(?P =平衡)部分不再工作,因为它不会匹配(由于括号内的文本不同的事实)。
在我的Python代码中,我使用VERBOSE和MULTILINE标志,并创建如下的正则表达式:
REGEX_STR = r"""# match any line that begins with a "for" or "while" statement: ^\s*(for|while)\s* \( # match the initial opening parenthesis # Now make a named group 'balanced' which matches # a balanced substring. (?P<balanced> # A balanced substring is either something that is not a parenthesis: [^()] | # …or a parenthesised string: \( # A parenthesised string begins with an opening parenthesis (?P=balanced)* # …followed by a sequence of balanced substrings \) # …and ends with a closing parenthesis )* # Look for a sequence of balanced substrings \) # Finally,the outer closing parenthesis. # must end with a semi-colon to match: \s*;\s*""" REGEX_OBJ = re.compile(REGEX_STR,re.MULTILINE| re.VERBOSE)
任何人都可以建议改进这个正则表达式?它变得太复杂,我不能让我的头。