我正在尝试创建一个正则表达式来捕获文本引用.
以下是文本引用的几个例句:
… and the reported results in (Nivre et al.,2007) were not representative …
… two systems used a Markov chain approach (Sagae and Tsujii 2007).
Nivre (2007) showed that …
… for attaching and labeling dependencies (Chen et al.,2007; Dredze et al.,2007).
目前,我的正则表达式是
\(\D*\d\d\d\d\)
哪个匹配示例1-3,但不匹配示例4.如何修改此示例以捕获示例4?
谢谢!
我最近为此目的使用了这样的东西:
#!/usr/bin/env perl use 5.010; use utf8; use strict; use autodie; use warnings qw< FATAL all >; use open qw< :std IO :utf8 >; my $citation_rx = qr{ \( (?: \s* # optional author list (?: # has to start capitalized \p{Uppercase_Letter} # then have a lower case letter,or maybe an apostrophe (?= [\p{Lowercase_Letter}\p{Quotation_Mark}] ) # before a run of letters and admissible punctuation [\p{Alphabetic}\p{Dash_Punctuation}\p{Quotation_Mark}\s,.] + ) ? # hook if and only if you want the authors to be optional!! # a reasonable year \b (18|19|20) \d\d # citation series suffix,up to a six-parter [a-f] ? \b # trailing semicolon to separate multiple citations ; ? \s* ) + \) }x; while (<DATA>) { while (/$citation_rx/gp) { say ${^MATCH}; } } __END__ ... and the reported results in (Nivré et al.,2007) were not representative ... ... two systems used a Markov chain approach (Sagae and Tsujii 2007). Nivre (2007) showed that ... ... for attaching and labelling dependencies (Chen et al.,2007; Dredze et al.,2007).
运行时,它会产生:
(Nivré et al.,2007) (Sagae and Tsujii 2007) (2007) (Chen et al.,2007)