我正在设计一个正则表达式来分割给定文本中的所有实际单词:
输入示例:
"John's mom went there,but he wasn't there. So she said: 'Where are you'"
预期产量:
["John's","mom","went","there","but","he","wasn't","So","she","said","Where","are","you"]
我想到了一个正则表达式:
"(([^a-zA-Z]+')|('[^a-zA-Z]+))|([^a-zA-Z']+)"
在Python中分割后,结果将包含None项和空格.
如何摆脱无物品?为什么空格不匹配?
编辑:
分割空间,会给出如下项目:[“那里”]
而在非信件上分裂,会给出类似的东西:[“约翰”,“s”]
除了’除了’以外的非字母分割,将会给出如下项:[“”Where“,”you“
您可以使用字符串函数代替正则表达式:
to_be_removed = ".,:!" # all characters to be removed s = "John's mom went there,but he wasn't there. So she said: 'Where are you!!'" for c in to_be_removed: s = s.replace(c,'') s.split()
但是,在你的例子中,你不想删除约翰的撇号,但是你想把它删除!因此,字符串操作在这一点上失败,您需要一个精细调整的正则表达式.
编辑:大概一个简单的正则表达式可以解决你的瑕疵:
(\w[\w']*)
它将捕获所有以字母开始的字符,并且在下一个字符是撇号或字母时保持捕获.
(\w[\w']*\w)
这个第二个正则表达式是针对一个非常具体的情况….第一个正则表达式可以捕获像你这样的字.如果是在单词内(而不是在开头或最后),这个将只能捕捉撇号.但是在这一点上,情况就像是这样,你不能用第二个正则表达式捕捉撇号的苔藓妈妈.您必须决定是否以名义结尾并定义所有权的名义捕获尾部撇号.
例:
rgx = re.compile("([\w][\w']*\w)") s = "John's mom went there,but he wasn't there. So she said: 'Where are you!!'" rgx.findall(s) ["John's",'mom','went','there','but','he','So','she','said','Where','are','you']
更新2:我在正则表达式中发现了一个错误!它不能捕获单个字母,后跟撇号像A’.固定全新的正则表达式在这里:
(\w[\w']*\w|\w) rgx = re.compile("(\w[\w']*\w|\w)") s = "John's mom went there,but he wasn't there. So she said: 'Where are you!!' 'A a'" rgx.findall(s) ["John's",'you','A','a']