如何找到两个多行字符串之间的相似度百分比?

Sta*_*ses 7 java algorithm levenshtein-distance

我有两个多行字符串.我正在使用以下代码来确定其中两个之间的相似性.这利用了Levenshtein距离算法.

  public static double similarity(String s1, String s2) {
    String longer = s1, shorter = s2;
    if (s1.length() < s2.length()) { 
      longer = s2; shorter = s1;
    }
    int longerLength = longer.length();
    if (longerLength == 0) { return 1.0; /* both strings are zero length */ }

    return (longerLength - editDistance(longer, shorter)) / (double) longerLength;

  }

  public static int editDistance(String s1, String s2) {
    s1 = s1.toLowerCase();
    s2 = s2.toLowerCase();

    int[] costs = new int[s2.length() + 1];
    for (int i = 0; i <= s1.length(); i++) {
      int lastValue = i;
      for (int j = 0; j <= s2.length(); j++) {
        if (i == 0)
          costs[j] = j;
        else {
          if (j > 0) {
            int newValue = costs[j - 1];
            if (s1.charAt(i - 1) != s2.charAt(j - 1))
              newValue = Math.min(Math.min(newValue, lastValue),
                  costs[j]) + 1;
            costs[j - 1] = lastValue;
            lastValue = newValue;
          }
        }
      }
      if (i > 0)
        costs[s2.length()] = lastValue;
    }
    return costs[s2.length()];
  }
Run Code Online (Sandbox Code Playgroud)

但上面的代码没有按预期工作.

比如让说,我们已经得到了以下两个字符串说s1s2,

S1 - > How do we optimize the performance? . What should we do to compare both strings to find the percentage of similarity between both?

S2-> How do we optimize tje performance? What should we do to compare both strings to find the percentage of similarity between both?

然后我将上面的字符串传递给相似性方法,但它没有找到确切的差异百分比.如何优化算法?

以下是我的主要方法

更新:

public static boolean authQuestion(String question) throws SQLException{


        boolean isQuestionAvailable = false;
        Connection dbCon = null;
        try {
            dbCon = MyResource.getConnection();
            String query = "SELECT * FROM WORDBANK where WORD ~*  ?;";
            PreparedStatement checkStmt = dbCon.prepareStatement(query);
            checkStmt.setString(1, question);
            ResultSet rs = checkStmt.executeQuery();
            while (rs.next()) {
                double re=similarity( rs.getString("question"), question);
                if(re  > 0.6){
                    isQuestionAvailable = true;
                }else {
                    isQuestionAvailable = false;
                }
            }
        } catch (URISyntaxException e1) {
            e1.printStackTrace();
        } catch (SQLException sqle) {
            sqle.printStackTrace();
        } catch (Exception e) {
            if (dbCon != null)
                dbCon.close();
        } finally {
            if (dbCon != null)
                dbCon.close();
        }

        return isQuestionAvailable;
    }
Run Code Online (Sandbox Code Playgroud)

Dan*_*iel 5

我可以建议你一个方法......

您正在使用编辑距离,它为您提供S1中需要更改/添加/删除的字符数,以便将其转换为S2.

所以,例如:

S1 = "abc"
S2 = "cde"
Run Code Online (Sandbox Code Playgroud)

编辑距离为3,它们是100%不同(考虑到你通过char比较在某种char中看到它).

所以,如果你这样做,你可以有一个大概的百分比

S1 = "abc"
S2 = "cde"
edit = edit_distance(S1, S2)
percentage = min(edit/S1.length(), edit/S2.length())
Run Code Online (Sandbox Code Playgroud)

min是一种解决方法,用于处理字符串非常不同的情况,例如:

S1 = "abc"
S2 = "defghijklmno"
Run Code Online (Sandbox Code Playgroud)

因此编辑距离将大于S1的长度,百分比应该大于100%,因此可能除以更大的尺寸应该更好.

希望有所帮助