是否有任何库可以让我测量两个字符串的“相同性”?更一般地说,有没有任何算法可以实现这一点,我可以实现(在Javascript中)?
以下面的字符串为例
Abnormal Elasticity of Single-Crystal Magnesiosiderite across the Spin
Transition in Earth’s Lower Mantle
并且还要考虑以下,略微调整的字符串.请注意不同的粗体部分
bnormal Elasticity of Single Crystal Magnesio-Siderite across the Spin-Transition in Earths Lower Mantle.
Javascript的本机相等运算符不会告诉你很多关于这些字符串之间的关系.在这种特殊情况下,您可以使用正则表达式匹配字符串,但一般情况下只有在您知道期望的差异时才有效.如果输入字符串是随机的,则此方法的一般性会很快崩溃.
方法……我可以想象编写一个算法,将输入字符串分成任意数量的N个子串,然后将目标字符串与所有这些子字符串匹配,并使用匹配量作为相同度的度量.但这感觉就像一个没有吸引力的方法,我甚至不想考虑O有多大将取决于N.
在我看来,这种算法中有很多自由参数.例如,字符的区分大小写是否应该对测量的贡献与字符的顺序保存相同/更多/更少,似乎是设计者可以做出的任意选择,即:
identicality("Abxy","bAxy")
versusidenticality("Abxy","aBxy")
更具体地定义要求……
第一个例子是我可以使用它的场景.我正在加载一堆字符串(学术论文的标题),我检查我的数据库中是否有它们.但是,源可能包含拼写错误,约定,错误等等的差异,这使得匹配很难.在这个特定的场景中,可能有一种更简单的方法来匹配标题:因为你可以预期会出现什么问题,这可以让你写下一些正则表达式的野兽.
解决方法
对于Hirschbers(“Abxy”,“bAxy”) results are:
It was 2 edit operations: keep: 3 insert: 1 delete: 1
而对于Hirschbers(“Abxy”,“aBxy”)results are:
It was 2 edit operations: keep: 2 replace: 2
您可以在this page查看javascript实现.
‘最佳’字符串对齐距离
function optimalStringAlignmentDistance(s,t) { // Determine the "optimal" string-alignment distance between s and t if (!s || !t) { return 99; } var m = s.length; var n = t.length; /* For all i and j,d[i][j] holds the string-alignment distance * between the first i characters of s and the first j characters of t. * Note that the array has (m+1)x(n+1) values. */ var d = new Array(); for (var i = 0; i <= m; i++) { d[i] = new Array(); d[i][0] = i; } for (var j = 0; j <= n; j++) { d[0][j] = j; } // Determine substring distances var cost = 0; for (var j = 1; j <= n; j++) { for (var i = 1; i <= m; i++) { cost = (s.charAt(i-1) == t.charAt(j-1)) ? 0 : 1; // Subtract one to start at strings' index zero instead of index one d[i][j] = Math.min(d[i][j-1] + 1,// insertion Math.min(d[i-1][j] + 1,// deletion d[i-1][j-1] + cost)); // substitution if(i > 1 && j > 1 && s.charAt(i-1) == t.charAt(j-2) && s.charAt(i-2) == t.charAt(j-1)) { d[i][j] = Math.min(d[i][j],d[i-2][j-2] + cost); // transposition } } } // Return the strings' distance return d[m][n]; } alert(optimalStringAlignmentDistance("Abxy","bAxy")) alert(optimalStringAlignmentDistance("Abxy","aBxy"))
Damerau-Levenshtein距离
function damerauLevenshteinDistance(s,t) { // Determine the Damerau-Levenshtein distance between s and t if (!s || !t) { return 99; } var m = s.length; var n = t.length; var charDictionary = new Object(); /* For all i and j,d[i][j] holds the Damerau-Levenshtein distance * between the first i characters of s and the first j characters of t. * Note that the array has (m+1)x(n+1) values. */ var d = new Array(); for (var i = 0; i <= m; i++) { d[i] = new Array(); d[i][0] = i; } for (var j = 0; j <= n; j++) { d[0][j] = j; } // Populate a dictionary with the alphabet of the two strings for (var i = 0; i < m; i++) { charDictionary[s.charAt(i)] = 0; } for (var j = 0; j < n; j++) { charDictionary[t.charAt(j)] = 0; } // Determine substring distances for (var i = 1; i <= m; i++) { var db = 0; for (var j = 1; j <= n; j++) { var i1 = charDictionary[t.charAt(j-1)]; var j1 = db; var cost = 0; if (s.charAt(i-1) == t.charAt(j-1)) { // Subtract one to start at strings' index zero instead of index one db = j; } else { cost = 1; } d[i][j] = Math.min(d[i][j-1] + 1,// deletion d[i-1][j-1] + cost)); // substitution if(i1 > 0 && j1 > 0) { d[i][j] = Math.min(d[i][j],d[i1-1][j1-1] + (i-i1-1) + (j-j1-1) + 1); //transposition } } charDictionary[s.charAt(i-1)] = i; } // Return the strings' distance return d[m][n]; } alert(damerauLevenshteinDistance("Abxy","aBxy")) alert(damerauLevenshteinDistance("Abxy","bAxy"))
Optimal String Alignment
has better 07006Optimal String Alignment Distance
0.20-0.30ms
Damerau-Levenshtein Distance0.40-0.50ms