Mar*_*gón 103 java string-comparison
我想比较几个字符串,找到最相似的字符串.我想知道是否有任何库,方法或最佳实践将返回我哪些字符串更类似于其他字符串.例如:
这种比较将返回第一个比第二个更相似.
我想我需要一些方法,例如:
double similarityIndex(String s1, String s2)
Run Code Online (Sandbox Code Playgroud)
某处有这样的事吗?
编辑:我为什么这样做?我正在编写一个脚本,将MS Project文件的输出与处理任务的某些遗留系统的输出进行比较.由于遗留系统的字段宽度非常有限,因此在添加值时,将缩写描述.我想要一些半自动的方式来查找MS Project中哪些条目与系统上的条目类似,这样我就可以获得生成的密钥.它有缺点,因为它必须仍然手动检查,但它会节省大量的工作
acd*_*ior 153
在许多库中使用的以0%-100%方式计算两个字符串之间相似性的常用方法是测量您需要多少(以%为单位)更改较长字符串以将其变为较短字符串:
/**
* Calculates the similarity (a number within 0 and 1) between two strings.
*/
public static double similarity(String s1, String s2) {
String longer = s1, shorter = s2;
if (s1.length() < s2.length()) { // longer should always have greater length
longer = s2; shorter = s1;
}
int longerLength = longer.length();
if (longerLength == 0) { return 1.0; /* both strings are zero length */ }
return (longerLength - editDistance(longer, shorter)) / (double) longerLength;
}
// you can use StringUtils.getLevenshteinDistance() as the editDistance() function
// full copy-paste working code is below
Run Code Online (Sandbox Code Playgroud)
editDistance()
:该editDistance()
函数上述预计以计算编辑距离的两个字符串之间.此步骤有多种实现方式,每种方法都可以更好地适应特定方案.最常见的是Levenshtein距离算法,我们将在下面的示例中使用它(对于非常大的字符串,其他算法可能表现更好).
以下是计算编辑距离的两个选项:
apply(CharSequence left, CharSequence rightt)
public class StringSimilarity {
/**
* Calculates the similarity (a number within 0 and 1) between two strings.
*/
public static double similarity(String s1, String s2) {
String longer = s1, shorter = s2;
if (s1.length() < s2.length()) { // longer should always have greater length
longer = s2; shorter = s1;
}
int longerLength = longer.length();
if (longerLength == 0) { return 1.0; /* both strings are zero length */ }
/* // If you have Apache Commons Text, you can use it to calculate the edit distance:
LevenshteinDistance levenshteinDistance = new LevenshteinDistance();
return (longerLength - levenshteinDistance.apply(longer, shorter)) / (double) longerLength; */
return (longerLength - editDistance(longer, shorter)) / (double) longerLength;
}
// Example implementation of the Levenshtein Edit Distance
// See http://rosettacode.org/wiki/Levenshtein_distance#Java
public static int editDistance(String s1, String s2) {
s1 = s1.toLowerCase();
s2 = s2.toLowerCase();
int[] costs = new int[s2.length() + 1];
for (int i = 0; i <= s1.length(); i++) {
int lastValue = i;
for (int j = 0; j <= s2.length(); j++) {
if (i == 0)
costs[j] = j;
else {
if (j > 0) {
int newValue = costs[j - 1];
if (s1.charAt(i - 1) != s2.charAt(j - 1))
newValue = Math.min(Math.min(newValue, lastValue),
costs[j]) + 1;
costs[j - 1] = lastValue;
lastValue = newValue;
}
}
}
if (i > 0)
costs[s2.length()] = lastValue;
}
return costs[s2.length()];
}
public static void printSimilarity(String s, String t) {
System.out.println(String.format(
"%.3f is the similarity between \"%s\" and \"%s\"", similarity(s, t), s, t));
}
public static void main(String[] args) {
printSimilarity("", "");
printSimilarity("1234567890", "1");
printSimilarity("1234567890", "123");
printSimilarity("1234567890", "1234567");
printSimilarity("1234567890", "1234567890");
printSimilarity("1234567890", "1234567980");
printSimilarity("47/2010", "472010");
printSimilarity("47/2010", "472011");
printSimilarity("47/2010", "AB.CDEF");
printSimilarity("47/2010", "4B.CDEFG");
printSimilarity("47/2010", "AB.CDEFG");
printSimilarity("The quick fox jumped", "The fox jumped");
printSimilarity("The quick fox jumped", "The fox");
printSimilarity("kitten", "sitting");
}
}
Run Code Online (Sandbox Code Playgroud)
输出:
1.000 is the similarity between "" and ""
0.100 is the similarity between "1234567890" and "1"
0.300 is the similarity between "1234567890" and "123"
0.700 is the similarity between "1234567890" and "1234567"
1.000 is the similarity between "1234567890" and "1234567890"
0.800 is the similarity between "1234567890" and "1234567980"
0.857 is the similarity between "47/2010" and "472010"
0.714 is the similarity between "47/2010" and "472011"
0.000 is the similarity between "47/2010" and "AB.CDEF"
0.125 is the similarity between "47/2010" and "4B.CDEFG"
0.000 is the similarity between "47/2010" and "AB.CDEFG"
0.700 is the similarity between "The quick fox jumped" and "The fox jumped"
0.350 is the similarity between "The quick fox jumped" and "The fox"
0.571 is the similarity between "kitten" and "sitting"
Run Code Online (Sandbox Code Playgroud)
dfa*_*dfa 80
是的,有许多记录良好的算法,如:
或者你可以检查一下
检查这些项目:
小智 14
我将Levenshtein距离算法翻译成了JavaScript:
String.prototype.LevenshteinDistance = function (s2) {
var array = new Array(this.length + 1);
for (var i = 0; i < this.length + 1; i++)
array[i] = new Array(s2.length + 1);
for (var i = 0; i < this.length + 1; i++)
array[i][0] = i;
for (var j = 0; j < s2.length + 1; j++)
array[0][j] = j;
for (var i = 1; i < this.length + 1; i++) {
for (var j = 1; j < s2.length + 1; j++) {
if (this[i - 1] == s2[j - 1]) array[i][j] = array[i - 1][j - 1];
else {
array[i][j] = Math.min(array[i][j - 1] + 1, array[i - 1][j] + 1);
array[i][j] = Math.min(array[i][j], array[i - 1][j - 1] + 1);
}
}
}
return array[this.length][s2.length];
};
Run Code Online (Sandbox Code Playgroud)
Flo*_*ser 11
您可以使用Levenshtein距离来计算两个字符串之间的差异. http://en.wikipedia.org/wiki/Levenshtein_distance
小智 10
确实存在很多字符串相似性度量:
你可以在这里找到解释和java实现:https: //github.com/tdebatty/java-string-similarity
你可以使用apache commons java库来实现这一点.看看其中的这两个函数:
- getLevenshteinDistance
- getFuzzyDistance
感谢第一个回答者,我认为computeEditDistance(s1, s2)有2次计算。由于花费了大量时间,决定提高代码的性能。所以:
public class LevenshteinDistance {
public static int computeEditDistance(String s1, String s2) {
s1 = s1.toLowerCase();
s2 = s2.toLowerCase();
int[] costs = new int[s2.length() + 1];
for (int i = 0; i <= s1.length(); i++) {
int lastValue = i;
for (int j = 0; j <= s2.length(); j++) {
if (i == 0) {
costs[j] = j;
} else {
if (j > 0) {
int newValue = costs[j - 1];
if (s1.charAt(i - 1) != s2.charAt(j - 1)) {
newValue = Math.min(Math.min(newValue, lastValue),
costs[j]) + 1;
}
costs[j - 1] = lastValue;
lastValue = newValue;
}
}
}
if (i > 0) {
costs[s2.length()] = lastValue;
}
}
return costs[s2.length()];
}
public static void printDistance(String s1, String s2) {
double similarityOfStrings = 0.0;
int editDistance = 0;
if (s1.length() < s2.length()) { // s1 should always be bigger
String swap = s1;
s1 = s2;
s2 = swap;
}
int bigLen = s1.length();
editDistance = computeEditDistance(s1, s2);
if (bigLen == 0) {
similarityOfStrings = 1.0; /* both strings are zero length */
} else {
similarityOfStrings = (bigLen - editDistance) / (double) bigLen;
}
//////////////////////////
//System.out.println(s1 + "-->" + s2 + ": " +
// editDistance + " (" + similarityOfStrings + ")");
System.out.println(editDistance + " (" + similarityOfStrings + ")");
}
public static void main(String[] args) {
printDistance("", "");
printDistance("1234567890", "1");
printDistance("1234567890", "12");
printDistance("1234567890", "123");
printDistance("1234567890", "1234");
printDistance("1234567890", "12345");
printDistance("1234567890", "123456");
printDistance("1234567890", "1234567");
printDistance("1234567890", "12345678");
printDistance("1234567890", "123456789");
printDistance("1234567890", "1234567890");
printDistance("1234567890", "1234567980");
printDistance("47/2010", "472010");
printDistance("47/2010", "472011");
printDistance("47/2010", "AB.CDEF");
printDistance("47/2010", "4B.CDEFG");
printDistance("47/2010", "AB.CDEFG");
printDistance("The quick fox jumped", "The fox jumped");
printDistance("The quick fox jumped", "The fox");
printDistance("The quick fox jumped",
"The quick fox jumped off the balcany");
printDistance("kitten", "sitting");
printDistance("rosettacode", "raisethysword");
printDistance(new StringBuilder("rosettacode").reverse().toString(),
new StringBuilder("raisethysword").reverse().toString());
for (int i = 1; i < args.length; i += 2) {
printDistance(args[i - 1], args[i]);
}
}
}
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
104008 次 |
最近记录: |