如何扩展下面的代码以允许我探索我的子字符串和父字符串之间有2个不匹配或更少的所有实例?
子串:SSQP
字符串匹配到:SSPQQQQPSSSSQQQSSQPSPSQSSQPSSQPPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQPSPSQ
以下是仅包含一个可能不匹配的示例:
>>> s = 'SSPQQQQPSSSSQQQSSQPSPSQSSQPSSQPPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQPSPSQ' >>> re.findall(r'(?=(SSQP|[A-Z]SQP|S[A-Z]QP|SS[A-Z]P|SSQ[A-Z]))',s) ['SSQQ','SSQP','SSQP']
解决方法
你不必在这里使用re,你可以使用
itertools
模块代替并节省大量内存.
您可以先提取长度为4的所有子字符串,然后将它们与您的子字符串进行比较,然后选择与您的子字符串差别小于2的子字符串:
from itertools import izip,islice,tee def sub_findre(s,substring,diffnumber): sublen=len(substring) zip_gen=(izip(substring,islice(s,i,i+sublen)) for i in xrange(len(s))) for z in zip_gen: l,z=tee(z) if sum(1 for i,j in l if i==j)>=sublen-diffnumber: new=izip(*z) next(new) yield ''.join(next(new))
演示:
s='SSPQQQQPSSSSQQQSSQPSPSQSSQPSSQPPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQPSPSQ' substring='SSQP' print list(sub_findre(s,2)) ['SSPQ','SPQQ','QQQP','SSSS','SSSQ','SSQQ','SQQQ','PSQS','SQPP','PSQ']
如果你想返回索引,你需要将索引放在izip中,你可以使用itertools.repeat()来重复索引的子串长度:
from itertools import izip,tee,repeat def sub_findre(s,i+sublen),repeat(i,sublen)) for i in xrange(len(s))) for z in zip_gen: l,j,_ in l if i==j)>=sublen-diffnumber: new=izip(*z) next(new) next(new) yield next(new)[0]
演示:
print list(sub_findre(s,2)) [0,1,4,8,9,10,11,15,20,23,27,28,32,33,34,39,42,46,47,48,53,56,60,61,62,67]