Java中的相似性字符串比较

Mar*_*gón 103 java string-comparison

我想比较几个字符串,找到最相似的字符串.我想知道是否有任何库,方法或最佳实践将返回我哪些字符串更类似于其他字符串.例如:

  • "快狐跳了起来" - >"狐狸跳了"
  • "快速狐狸跳了" - >"狐狸"

这种比较将返回第一个比第二个更相似.

我想我需要一些方法,例如:

double similarityIndex(String s1, String s2)
Run Code Online (Sandbox Code Playgroud)

某处有这样的事吗?

编辑:我为什么这样做?我正在编写一个脚本,将MS Project文件的输出与处理任务的某些遗留系统的输出进行比较.由于遗留系统的字段宽度非常有限,因此在添加值时,将缩写描述.我想要一些半自动的方式来查找MS Project中哪些条目与系统上的条目类似,这样我就可以获得生成的密钥.它有缺点,因为它必须仍然手动检查,但它会节省大量的工作

acd*_*ior 153

在许多库中使用的以0%-100%方式计算两个字符串之间相似性的常用方法是测量您需要多少(以%为单位)更改较长字符串以将其变为较短字符串:

/**
 * Calculates the similarity (a number within 0 and 1) between two strings.
 */
public static double similarity(String s1, String s2) {
  String longer = s1, shorter = s2;
  if (s1.length() < s2.length()) { // longer should always have greater length
    longer = s2; shorter = s1;
  }
  int longerLength = longer.length();
  if (longerLength == 0) { return 1.0; /* both strings are zero length */ }
  return (longerLength - editDistance(longer, shorter)) / (double) longerLength;
}
// you can use StringUtils.getLevenshteinDistance() as the editDistance() function
// full copy-paste working code is below
Run Code Online (Sandbox Code Playgroud)


计算editDistance():

editDistance()函数上述预计以计算编辑距离的两个字符串之间.此步骤有多种实现方式,每种方法都可以更好地适应特定方案.最常见的是Levenshtein距离算法,我们将在下面的示例中使用它(对于非常大的字符串,其他算法可能表现更好).

以下是计算编辑距离的两个选项:


工作范例:

在这里查看在线演示.

public class StringSimilarity {

  /**
   * Calculates the similarity (a number within 0 and 1) between two strings.
   */
  public static double similarity(String s1, String s2) {
    String longer = s1, shorter = s2;
    if (s1.length() < s2.length()) { // longer should always have greater length
      longer = s2; shorter = s1;
    }
    int longerLength = longer.length();
    if (longerLength == 0) { return 1.0; /* both strings are zero length */ }
    /* // If you have Apache Commons Text, you can use it to calculate the edit distance:
    LevenshteinDistance levenshteinDistance = new LevenshteinDistance();
    return (longerLength - levenshteinDistance.apply(longer, shorter)) / (double) longerLength; */
    return (longerLength - editDistance(longer, shorter)) / (double) longerLength;

  }

  // Example implementation of the Levenshtein Edit Distance
  // See http://rosettacode.org/wiki/Levenshtein_distance#Java
  public static int editDistance(String s1, String s2) {
    s1 = s1.toLowerCase();
    s2 = s2.toLowerCase();

    int[] costs = new int[s2.length() + 1];
    for (int i = 0; i <= s1.length(); i++) {
      int lastValue = i;
      for (int j = 0; j <= s2.length(); j++) {
        if (i == 0)
          costs[j] = j;
        else {
          if (j > 0) {
            int newValue = costs[j - 1];
            if (s1.charAt(i - 1) != s2.charAt(j - 1))
              newValue = Math.min(Math.min(newValue, lastValue),
                  costs[j]) + 1;
            costs[j - 1] = lastValue;
            lastValue = newValue;
          }
        }
      }
      if (i > 0)
        costs[s2.length()] = lastValue;
    }
    return costs[s2.length()];
  }

  public static void printSimilarity(String s, String t) {
    System.out.println(String.format(
      "%.3f is the similarity between \"%s\" and \"%s\"", similarity(s, t), s, t));
  }

  public static void main(String[] args) {
    printSimilarity("", "");
    printSimilarity("1234567890", "1");
    printSimilarity("1234567890", "123");
    printSimilarity("1234567890", "1234567");
    printSimilarity("1234567890", "1234567890");
    printSimilarity("1234567890", "1234567980");
    printSimilarity("47/2010", "472010");
    printSimilarity("47/2010", "472011");
    printSimilarity("47/2010", "AB.CDEF");
    printSimilarity("47/2010", "4B.CDEFG");
    printSimilarity("47/2010", "AB.CDEFG");
    printSimilarity("The quick fox jumped", "The fox jumped");
    printSimilarity("The quick fox jumped", "The fox");
    printSimilarity("kitten", "sitting");
  }

}
Run Code Online (Sandbox Code Playgroud)

输出:

1.000 is the similarity between "" and ""
0.100 is the similarity between "1234567890" and "1"
0.300 is the similarity between "1234567890" and "123"
0.700 is the similarity between "1234567890" and "1234567"
1.000 is the similarity between "1234567890" and "1234567890"
0.800 is the similarity between "1234567890" and "1234567980"
0.857 is the similarity between "47/2010" and "472010"
0.714 is the similarity between "47/2010" and "472011"
0.000 is the similarity between "47/2010" and "AB.CDEF"
0.125 is the similarity between "47/2010" and "4B.CDEFG"
0.000 is the similarity between "47/2010" and "AB.CDEFG"
0.700 is the similarity between "The quick fox jumped" and "The fox jumped"
0.350 is the similarity between "The quick fox jumped" and "The fox"
0.571 is the similarity between "kitten" and "sitting"
Run Code Online (Sandbox Code Playgroud)

  • Levenshtein距离法可以在`org.apache.commons.lang3.StringUtils`中找到. (10认同)

dfa*_*dfa 80

是的,有许多记录良好的算法,如:

  • 余弦相似度
  • Jaccard的相似性
  • 骰子的系数
  • 匹配相似度
  • 重叠相似性
  • 等等

或者你可以检查一下

检查这些项目:

  • +1 simmetrics网站似乎不再活跃.但是,我在sourceforge上找到了代码:http://sourceforge.net/projects/simmetrics/感谢指针. (18认同)
  • "你可以检查这个"链接被破坏了. (7认同)
  • 这就是 Michael Merchant 在上面发布正确链接的原因。 (2认同)
  • sourceforge上simmetrics的jar有点过时,https://github.com/mpkorstanje/simmetrics是带有maven工件的更新的github页面 (2认同)

小智 14

我将Levenshtein距离算法翻译成了JavaScript:

String.prototype.LevenshteinDistance = function (s2) {
    var array = new Array(this.length + 1);
    for (var i = 0; i < this.length + 1; i++)
        array[i] = new Array(s2.length + 1);

    for (var i = 0; i < this.length + 1; i++)
        array[i][0] = i;
    for (var j = 0; j < s2.length + 1; j++)
        array[0][j] = j;

    for (var i = 1; i < this.length + 1; i++) {
        for (var j = 1; j < s2.length + 1; j++) {
            if (this[i - 1] == s2[j - 1]) array[i][j] = array[i - 1][j - 1];
            else {
                array[i][j] = Math.min(array[i][j - 1] + 1, array[i - 1][j] + 1);
                array[i][j] = Math.min(array[i][j], array[i - 1][j - 1] + 1);
            }
        }
    }
    return array[this.length][s2.length];
};
Run Code Online (Sandbox Code Playgroud)


Flo*_*ser 11

您可以使用Levenshtein距离来计算两个字符串之间的差异. http://en.wikipedia.org/wiki/Levenshtein_distance

  • Levenshtein很适合一些字符串,但不会扩展到大量字符串之间的比较. (2认同)

小智 10

确实存在很多字符串相似性度量:

  • Levenshtein编辑距离;
  • Damerau-Levenshtein距离;
  • Jaro-Winkler相似;
  • 最长公共子序列编辑距离;
  • Q-Gram(Ukkonen);
  • n-Gram距离(Kondrak);
  • Jaccard指数;
  • 索伦森 - 骰子系数;
  • 余弦相似度;
  • ...

你可以在这里找到解释和java实现:https: //github.com/tdebatty/java-string-similarity


noe*_*cus 6

你可以使用apache commons java库来实现这一点.看看其中的这两个函数:
- getLevenshteinDistance
- getFuzzyDistance

  • 截至2017年10月,已弃用链接方法.使用类[LevenshteinDistance](https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/similarity/LevenshteinDistance.html)和[FuzzyScore](https: //commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/similarity/FuzzyScore.html)来自[commons text library](https://commons.apache. org/proper/commons-text /)代替 (2认同)

Moh*_*asi 5

感谢第一个回答者,我认为computeEditDistance(s1, s2)有2次计算。由于花费了大量时间,决定提高代码的性能。所以:

public class LevenshteinDistance {

public static int computeEditDistance(String s1, String s2) {
    s1 = s1.toLowerCase();
    s2 = s2.toLowerCase();

    int[] costs = new int[s2.length() + 1];
    for (int i = 0; i <= s1.length(); i++) {
        int lastValue = i;
        for (int j = 0; j <= s2.length(); j++) {
            if (i == 0) {
                costs[j] = j;
            } else {
                if (j > 0) {
                    int newValue = costs[j - 1];
                    if (s1.charAt(i - 1) != s2.charAt(j - 1)) {
                        newValue = Math.min(Math.min(newValue, lastValue),
                                costs[j]) + 1;
                    }
                    costs[j - 1] = lastValue;
                    lastValue = newValue;
                }
            }
        }
        if (i > 0) {
            costs[s2.length()] = lastValue;
        }
    }
    return costs[s2.length()];
}

public static void printDistance(String s1, String s2) {
    double similarityOfStrings = 0.0;
    int editDistance = 0;
    if (s1.length() < s2.length()) { // s1 should always be bigger
        String swap = s1;
        s1 = s2;
        s2 = swap;
    }
    int bigLen = s1.length();
    editDistance = computeEditDistance(s1, s2);
    if (bigLen == 0) {
        similarityOfStrings = 1.0; /* both strings are zero length */
    } else {
        similarityOfStrings = (bigLen - editDistance) / (double) bigLen;
    }
    //////////////////////////
    //System.out.println(s1 + "-->" + s2 + ": " +
      //      editDistance + " (" + similarityOfStrings + ")");
    System.out.println(editDistance + " (" + similarityOfStrings + ")");
}

public static void main(String[] args) {
    printDistance("", "");
    printDistance("1234567890", "1");
    printDistance("1234567890", "12");
    printDistance("1234567890", "123");
    printDistance("1234567890", "1234");
    printDistance("1234567890", "12345");
    printDistance("1234567890", "123456");
    printDistance("1234567890", "1234567");
    printDistance("1234567890", "12345678");
    printDistance("1234567890", "123456789");
    printDistance("1234567890", "1234567890");
    printDistance("1234567890", "1234567980");

    printDistance("47/2010", "472010");
    printDistance("47/2010", "472011");

    printDistance("47/2010", "AB.CDEF");
    printDistance("47/2010", "4B.CDEFG");
    printDistance("47/2010", "AB.CDEFG");

    printDistance("The quick fox jumped", "The fox jumped");
    printDistance("The quick fox jumped", "The fox");
    printDistance("The quick fox jumped",
            "The quick fox jumped off the balcany");
    printDistance("kitten", "sitting");
    printDistance("rosettacode", "raisethysword");
    printDistance(new StringBuilder("rosettacode").reverse().toString(),
            new StringBuilder("raisethysword").reverse().toString());
    for (int i = 1; i < args.length; i += 2) {
        printDistance(args[i - 1], args[i]);
    }


 }
}
Run Code Online (Sandbox Code Playgroud)