c – 哪种字符串查找算法适用于此？

我有一个大字串说“aaaaaaaaaaabbbbbbbbbcccccccccccdddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd我想计算(重叠是好的)在大字符串中找到小字符串的次数.我只关心速度. KMP似乎很好,但看起来Rabin-Karp处理了多个但速度很慢.

解决方法

大多数字符串搜索算法的问题是它们将至少花费时间O(k)来返回k个匹配,所以如果你有一个字符串说100万“a”s,以及100万个小字符串查询“a”,那么它将花费大约100万次,数百万次迭代来计算所有比赛！

另一种线性时间方法是：

>构造一个大字符串的后缀树：O(n)其中n是len(大字符串)
>预先计算后缀树中每个节点下面的后缀数：O(n)
>对于每个小字符串,在后缀树中找到具有小字符串作为后缀的节点：O(m)其中m是len(小字符串)
>将总节点数添加到该节点下方的后缀数. (每个后缀对应于大字符串中小字符串的不同匹配)

这将花费时间O(n p),其中n是大字符串的长度,p是所有小字符串的总长度.

示例代码

根据要求,这里有一些Python中使用这种方法的小(ish)示例代码：

from collections import defaultdict

class SuffixTree:
    def __init__(self):
        """Returns an empty suffix tree"""
        self.T=''
        self.E={}
        self.nodes=[-1] # 0th node is empty string

    def add(self,s):
        """Adds the input string to the suffix tree.

        This inserts all substrings into the tree.
        End the string with a unique character if you want a leaf-node for every suffix.

        Produces an edge graph keyed by (node,character) that gives (first,last,end)
        This means that the edge has characters from T[first:last+1] and goes to node end."""
        origin,first,last = 0,len(self.T),len(self.T)-1
        self.T+=s
        nc = len(self.nodes)
        self.nodes += [-1]*(2*len(s))
        T=self.T
        E=self.E
        nodes=self.nodes

        Lm1=len(T)-1
        for last_char_index in xrange(first,len(T)):
            c=T[last_char_index]
            last_parent_node = -1                    
            while 1:
                parent_node = origin
                if first>last:
                    if (origin,c) in E:
                        break             
                else:
                    key = origin,T[first]
                    edge_first,edge_last,edge_end = E[key]
                    span = last - first
                    A = edge_first+span
                    m = T[A+1]
                    if m==c:
                        break
                    E[key] = (edge_first,A,nc)
                    nodes[nc] = origin
                    E[nc,m] = (A+1,edge_end)
                    parent_node = nc
                    nc+=1  
                E[parent_node,c] = (last_char_index,Lm1,nc)
                nc+=1  
                if last_parent_node>0:
                    nodes[last_parent_node] = parent_node
                last_parent_node = parent_node
                if origin==0:
                    first+=1
                else:
                    origin = nodes[origin]

                if first <= last:
                    edge_first,edge_end=E[origin,T[first]]
                    span = edge_last-edge_first
                    while span <= last - first:
                        first+=span+1
                        origin = edge_end
                        if first <= last:
                            edge_first,edge_end = E[origin,T[first]]
                            span = edge_last - edge_first

            if last_parent_node>0:
                nodes[last_parent_node] = parent_node
            last+=1
            if first <= last:
                    edge_first,T[first]]
                            span = edge_last - edge_first
        return self


    def make_choices(self):
        """Construct a sorted list for each node of the possible continuing characters"""
        choices = [list() for n in xrange(len(self.nodes))] # Contains set of choices for each node
        for (origin,c),edge in self.E.items():
            choices[origin].append(c)
        choices=[sorted(s) for s in choices] # should not have any repeats by construction
        self.choices=choices
        return choices


    def count_suffixes(self,term):
        """Recurses through the tree finding how many suffixes are based at each node.
        Strings assumed to use term as the terminating character"""
        C = self.suffix_counts = [0]*len(self.nodes)
        choices = self.make_choices()
        def f(node=0):
            t=0
            X=choices[node]
            if len(X)==0:
                t+=1 # this node is a leaf node
            else:
                for c in X:
                    if c==term:
                        t+=1
                        continue
                    first,end = self.E[node,c]
                    t+=f(end)
            C[node]=t
            return t
        return f()

    def count_matches(self,needle):
        """Return the count of matches for this needle in the suffix tree"""
        i=0
        node=0
        E=self.E
        T=self.T
        while i<len(needle):
            c=needle[i]
            key=node,c
            if key not in E:
                return 0
            first,node = E[key]
            while i<len(needle) and first<=last:
                if needle[i]!=T[first]:
                    return 0
                i+=1
                first+=1
        return self.suffix_counts[node]


big="aaaaaaaaaaabbbbbbbbbcccccccccccddddddddddd"
small_strings=["a","ab","abc"]
s=SuffixTree()
term=chr(0)
s.add(big+term)
s.count_suffixes(term)
for needle in small_strings:
    x=s.count_matches(needle)
    print needle,'has',x,'matches'

它打印：

a has 11 matches 
ab has 1 matches 
abc has 0 matches

但是,在实践中,我建议您只使用预先存在的Aho-Corasick实现,因为我希望在您的特定情况下这会更快.

c – 哪种字符串查找算法适用于此？

解决方法

猜你在找的C&C++相关文章