我有一个大字串说“aaaaaaaaaaabbbbbbbbbcccccccccccdddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd我想计算(重叠是好的)在大字符串中找到小字符串的次数.我只关心速度. KMP似乎很好,但看起来Rabin-Karp处理了多个但速度很慢.
解决方法
大多数字符串搜索算法的问题是它们将至少花费时间O(k)来返回k个匹配,所以如果你有一个字符串说100万“a”s,以及100万个小字符串查询“a”,那么它将花费大约100万次,数百万次迭代来计算所有比赛!
另一种线性时间方法是:
>构造一个大字符串的后缀树:O(n)其中n是len(大字符串)
>预先计算后缀树中每个节点下面的后缀数:O(n)
>对于每个小字符串,在后缀树中找到具有小字符串作为后缀的节点:O(m)其中m是len(小字符串)
>将总节点数添加到该节点下方的后缀数. (每个后缀对应于大字符串中小字符串的不同匹配)
这将花费时间O(n p),其中n是大字符串的长度,p是所有小字符串的总长度.
示例代码
根据要求,这里有一些Python中使用这种方法的小(ish)示例代码:
from collections import defaultdict class SuffixTree: def __init__(self): """Returns an empty suffix tree""" self.T='' self.E={} self.nodes=[-1] # 0th node is empty string def add(self,s): """Adds the input string to the suffix tree. This inserts all substrings into the tree. End the string with a unique character if you want a leaf-node for every suffix. Produces an edge graph keyed by (node,character) that gives (first,last,end) This means that the edge has characters from T[first:last+1] and goes to node end.""" origin,first,last = 0,len(self.T),len(self.T)-1 self.T+=s nc = len(self.nodes) self.nodes += [-1]*(2*len(s)) T=self.T E=self.E nodes=self.nodes Lm1=len(T)-1 for last_char_index in xrange(first,len(T)): c=T[last_char_index] last_parent_node = -1 while 1: parent_node = origin if first>last: if (origin,c) in E: break else: key = origin,T[first] edge_first,edge_last,edge_end = E[key] span = last - first A = edge_first+span m = T[A+1] if m==c: break E[key] = (edge_first,A,nc) nodes[nc] = origin E[nc,m] = (A+1,edge_end) parent_node = nc nc+=1 E[parent_node,c] = (last_char_index,Lm1,nc) nc+=1 if last_parent_node>0: nodes[last_parent_node] = parent_node last_parent_node = parent_node if origin==0: first+=1 else: origin = nodes[origin] if first <= last: edge_first,edge_end=E[origin,T[first]] span = edge_last-edge_first while span <= last - first: first+=span+1 origin = edge_end if first <= last: edge_first,edge_end = E[origin,T[first]] span = edge_last - edge_first if last_parent_node>0: nodes[last_parent_node] = parent_node last+=1 if first <= last: edge_first,T[first]] span = edge_last - edge_first return self def make_choices(self): """Construct a sorted list for each node of the possible continuing characters""" choices = [list() for n in xrange(len(self.nodes))] # Contains set of choices for each node for (origin,c),edge in self.E.items(): choices[origin].append(c) choices=[sorted(s) for s in choices] # should not have any repeats by construction self.choices=choices return choices def count_suffixes(self,term): """Recurses through the tree finding how many suffixes are based at each node. Strings assumed to use term as the terminating character""" C = self.suffix_counts = [0]*len(self.nodes) choices = self.make_choices() def f(node=0): t=0 X=choices[node] if len(X)==0: t+=1 # this node is a leaf node else: for c in X: if c==term: t+=1 continue first,end = self.E[node,c] t+=f(end) C[node]=t return t return f() def count_matches(self,needle): """Return the count of matches for this needle in the suffix tree""" i=0 node=0 E=self.E T=self.T while i<len(needle): c=needle[i] key=node,c if key not in E: return 0 first,node = E[key] while i<len(needle) and first<=last: if needle[i]!=T[first]: return 0 i+=1 first+=1 return self.suffix_counts[node] big="aaaaaaaaaaabbbbbbbbbcccccccccccddddddddddd" small_strings=["a","ab","abc"] s=SuffixTree() term=chr(0) s.add(big+term) s.count_suffixes(term) for needle in small_strings: x=s.count_matches(needle) print needle,'has',x,'matches'
它打印:
a has 11 matches ab has 1 matches abc has 0 matches
但是,在实践中,我建议您只使用预先存在的Aho-Corasick实现,因为我希望在您的特定情况下这会更快.