算法 – 在文本之间返回关联的函数？

考虑我有一个

string1 = "hello hi goodmorning evening [...]"

我有一些次要关键字

compare1 = "hello evening"
compare2 = "hello hi"

我需要一个返回文本和关键字之间的亲和力的函数.例：

function(string1,compare1);  // returns: 4
function(string1,compare2);  // returns: 5 (more relevant)

请注意5和4仅仅是例子.

你可以说 – 写一个计算事件的函数 – 但是在这个例子中,这不会有效,因为两个都有2次出现,但是compare1不太相关,因为“hello evening”并没有在string1中精确找到比你好远)

有没有任何已知的算法来做到这一点？

ADD1：

在这种情况下,像编辑距离这样的algos将无法正常工作.
因为string1是一个完整的文本(比如300-400字),比较字符串最多为4-5个字.

解决方法

动态编程算法

看起来你正在寻找的是非常类似于Smith–Waterman algorithm做的.

维基百科：

The algorithm was first proposed by Temple F. Smith and Michael S. Waterman in 1981. Like the 07001 algorithm,of which it is a variation,Smith-Waterman is a 07002. As such,it has the desirable property that it is guaranteed to find the optimal local alignment with respect to the scoring system being used (which includes the substitution matrix and the gap-scoring scheme).

我们来看一个实际的例子,所以你可以评估它的有用性.

假设我们有一个文本：

text = "We the people of the United States,in order to form a more 
perfect union,establish justice,insure domestic tranquility,provide for the common defense,promote the general welfare,and secure the blessings of liberty to ourselves and our posterity,do ordain and establish this Constitution for the United States of 
America.";

我孤立了我们要匹配的细分,只是为了你的阅读容易.

我们将比较亲和力(或相似性)与字符串列表：

list = {
   "the general welfare","my personal welfare","general utopian welfare","the general","promote welfare","stackoverflow rulez"
   };

我已经实现了算法,所以我将计算相似度并对结果进行归一化：

sw = SmithWatermanSimilarity[ text,#] & /@ list;
swN = (sw - Min[sw])/(Max[sw] - Min[sw])

然后我们绘制结果：

我认为这与你的预期结果非常相似.

HTH！

一些实现(w /源代码)

> Smith-Waterman CUDA Source Code
(GSW)
> The S-M algorithm explained
(presentation)
> An interactive step-by-step demo
applet
> Java Source code
> Python source code

算法 – 在文本之间返回关联的函数？

解决方法

猜你在找的HTML相关文章