查找字符串列表中最长公共前缀/后缀的出现次数？

Question

查找字符串列表中最长公共前缀/后缀的出现次数？

Lev*_*tts 9 java string algorithm hashmap

给定一个字符串列表：

ArrayList<String> strList = new ArrayList<String>();
strList.add("Mary had a little lamb named Willy");
strList.add("Mary had a little ham");
strList.add("Old McDonald had a farm named Willy");
strList.add("Willy had a little dog named ham");
strList.add("(abc)");
strList.add("(xyz)");
strList.add("Visit Target Store");
strList.add("Visit Walmart Store");

Run Code Online (Sandbox Code Playgroud)

HashMap<String, Integer> prefixMap这应该产生 a和形式的输出suffixMap：

字首：

Mary had a -> 2
Mary had a little -> 2
( -> 2
Visit -> 2

Run Code Online (Sandbox Code Playgroud)

后缀：

named Willy -> 2
ham -> 2
) -> 2
Store -> 2

Run Code Online (Sandbox Code Playgroud)

到目前为止，我可以使用以下代码生成列表中所有项目中都存在的前缀：

public static final int INDEX_NOT_FOUND = -1;

public static String getAllCommonPrefixesInList(final String... strs) {
    if (strs == null || strs.length == 0) {
        return EMPTY_STRING;
    }
    
    
    final int smallestIndexOfDiff = getIndexOfDifference(strs);
    if (smallestIndexOfDiff == INDEX_NOT_FOUND) {
        
        // All Strings are identical
        if (strs[0] == null) {
            return EMPTY_STRING;
        }
        return strs[0];
    } else if (smallestIndexOfDiff == 0) {
        
        
        // No common initial characters found, return empty String
        return EMPTY_STRING;
    } else {
        
        // Common initial character sequence found, return sequence
        return strs[0].substring(0, smallestIndexOfDiff);
    }
}






public static int getIndexOfDifference(final CharSequence... charSequence) {
    if (charSequence == null || charSequence.length <= 1) {
        return INDEX_NOT_FOUND;
    }
    boolean isAnyStringNull = false;
    boolean areAllStringsNull = true;
    
    
    final int arrayLen = charSequence.length;
    int shortestStrLen = Integer.MAX_VALUE;
    int longestStrLen = 0;

    // Find the min and max string lengths - avoids having to check that we are not exceeding the length of the string each time through the bottom loop.
    for (int i = 0; i < arrayLen; i++) {
        if (charSequence[i] == null) {
            isAnyStringNull = true;
            shortestStrLen = 0;
        } else {
            areAllStringsNull = false;
            shortestStrLen = Math.min(charSequence[i].length(), shortestStrLen);
            longestStrLen = Math.max(charSequence[i].length(), longestStrLen);
        }
    }

    // Deals with lists containing all nulls or all empty strings
    
    if (areAllStringsNull || longestStrLen == 0 && !isAnyStringNull) {
        return INDEX_NOT_FOUND;
    }

    // Handle lists containing some nulls or some empty strings
    if (shortestStrLen == 0) {
        return 0;
    }

    // Find the position with the first difference across all strings
    int firstDiff = -1;
    for (int stringPos = 0; stringPos < shortestStrLen; stringPos++) {
        final char comparisonChar = charSequence[0].charAt(stringPos);
        for (int arrayPos = 1; arrayPos < arrayLen; arrayPos++) {
            if (charSequence[arrayPos].charAt(stringPos) != comparisonChar) {
                firstDiff = stringPos;
                break;
            }
        }
        if (firstDiff != -1) {
            break;
        }
    }

    if (firstDiff == -1 && shortestStrLen != longestStrLen) {
        
        // We compared all of the characters up to the length of the
        // shortest string and didn't find a match, but the string lengths
        // vary, so return the length of the shortest string.
        return shortestStrLen;
    }
    return firstDiff;
}

Run Code Online (Sandbox Code Playgroud)

但是，我的目标是将至少出现的任何前缀/后缀2+包含到结果映射中。

如何实现这一点Java？

Answer 1

Ale*_*nko 4

根据我对这个问题的理解，最适合解决它的数据结构是非循环脱节图。

一般情况下，一个图将由几个不相连的簇组成。每个簇都有一个树状结构，在边缘情况下它将形成一个链表。

基本上，解决这个问题的最简单的方法是根据每一行创建一堆链表，然后迭代它们。缺点是：节点重复（内存消耗更大）、时间复杂度更高（需要更多操作）并且更容易出错，因为需要更多的手动操作。

图表的描述

因此，我将坚持使用图表作为该问题的数据结构，并尝试使事情尽可能简单。

让我们考虑以下输入：

"Mary had a little lamb named Willy"
"Mary had a little ham"
"A B C"

Run Code Online (Sandbox Code Playgroud)

该图的图形表示如下所示；

前两行将构成一个由链表（头部）和树（尾部）组成的簇。第二个簇将由一个链表表示，它的顶点不与其他字符串形成的顶点连接。

这不是构造顶点的唯一方法，头部可以生成N 树，并且可以在中间的某个位置观察到链表。

主要要点是，为了解决这个问题，我们需要通过所有分支跟踪顶点链，直到顶点重叠。在图的这些部分中，两条或多条线中常见的每个前缀字符串和后缀字符串将由单个顶点（节点）表示。

为了维护映射到特定顶点的字符串数量，每个顶点应该有一个变量（int groupCount 在下面的代码中），该变量在创建顶点时分配一个默认值，并在每次映射新字符串时1递增。到这个顶点。

每个顶点都包含一个映射，其中保存对其邻居的引用。当添加新的邻居顶点Vertex时，基于给定字符串创建的新顶点或现有顶点的计数会增加。

为了符合此任务，图应维护对所有头顶点和尾顶点的引用。为了简单起见，在这个解决方案中，图实际上由两个图（ suffix -graph和前缀图）。因此，named 中的实现类。MultiGraph

为了用顶点填充后缀图和前缀图，方法以正常或相反的顺序addCluster()迭代给定行的字符串Iterator，具体取决于正在填充的图。

深度优先搜索

填充图表后的下一步是生成频率为2及以上的字符串映射。

为此，正在使用经典的深度优先搜索算法。

为了实现 DFS，需要一个用作堆栈ArrayDeque的可变容器（正用于此目的）。从正面/反面地图中取出的第一个元素将被放置在堆栈的顶部，并且保存该元素名称的实例StringBuilder将被放置在地图中。

然后，要恢复具有特定count 的字符串，将从堆栈顶部弹出顶点，并将其具有 count 的邻居依次放置在堆栈顶部。当前前缀的副本以及附加的分隔符和邻居的名称将被映射到neighbour-vertex。> 1

如果计数发生变化，则表明当前前缀表示至少两行之间的最长公共字符串。在这种情况下，前缀和计数将添加到结果映射中。

执行

以下实现由两个类组成，它们是狭隘的和独立的。该类MultiGraph专门充当数据结构，维护两个图。羽化代码，就像分割字符串的行一样，被提取到一个单独的类中GraphManager。

图形

"Mary had a little lamb named Willy" "Mary had a little ham" "A B C"
Run Code Online (Sandbox Code Playgroud)
下面的类处理处理输入数据的任务：它分割行，负责丢弃可能出现的空字符串，并将输入打包到 a 中Deque以方便地适应两个方向的迭代。

它还实例化图表并管理其工作。GraphManager负责为图形提供分隔符，以便在创建结果映射时恢复字符串的初始形状。这样，您可以在空白处分割给定的行，通过空字符串逐个字符地处理行，或者通过标点符号来分割，而无需更改这两个类中的任何一行。

图形管理器

public class MultiGraph { private final Map<String, Vertex> heads = new HashMap<>(); private final Map<String, Vertex> tails = new HashMap<>(); public void addCluster(Deque<String> names) { addCluster(heads, names.iterator()); addCluster(tails, names.descendingIterator()); } private void addCluster(Map<String, Vertex> clusters, Iterator<String> names) { String rootName = names.next(); if (clusters.containsKey(rootName)) { clusters.get(rootName).incrementGroupCount(); } else { clusters.put(rootName, new Vertex(rootName)); } Vertex current = clusters.get(rootName); while (names.hasNext()) { current = current.addNext(names.next()); } } public Map<String, Integer> generatePrefixMap(String delimiter) { Map<String, Integer> countByPrefix = new HashMap<>(); for (Vertex next: heads.values()) { if (next.getGroupCount() == 1) { continue; } performDFS(heads, countByPrefix, delimiter, next); } return countByPrefix; } public Map<String, Integer> generateSuffixMap(String delimiter) { Map<String, Integer> countBySuffix = new HashMap<>(); for (Vertex next: tails.values()) { if (next.getGroupCount() == 1) { continue; } performDFS(tails, countBySuffix, delimiter, next); } return countBySuffix; } // implementation of the Depth First Search algorithm public void performDFS(Map<String, Vertex> clusters, Map<String, Integer> countByPrefix, String delimiter, Vertex next) { StringBuilder prefix = null; Vertex current = next; int count = next.getGroupCount(); Deque<Vertex> stack = new ArrayDeque<>(); // create as stack Map<Vertex, StringBuilder> prefixByVert = new HashMap<>(); stack.push(next); // place the first element on the stack prefixByVert.put(current, new StringBuilder(current.getName())); while (!stack.isEmpty()) { current = stack.pop(); if (current.getGroupCount() < count) { // the number of strings mapped to the current Vertex has been changed countByPrefix.put(prefix.toString(), count); // saving the result count = current.getGroupCount(); } prefix = prefixByVert.get(current); for (Vertex neighbour: current.getNextVertByVal().values()) { if (next.getGroupCount() == 1) { continue; } stack.push(neighbour); prefixByVert.put(neighbour, new StringBuilder(prefix) .append(delimiter) .append(neighbour.getName())); } } if (prefix != null && count > 1) { countByPrefix.putIfAbsent(prefix.toString(), count); } } private static class Vertex { private final String name; private int groupCount = 1; private final Map<String, Vertex> nextVertByVal = new HashMap<>(); public Vertex(String name) { this.name = name; } public Vertex addNext(String value) { if (nextVertByVal.containsKey(value)) { nextVertByVal.get(value).incrementGroupCount(); } else { nextVertByVal.put(value, new Vertex(value)); } return nextVertByVal.get(value); } public void incrementGroupCount() { this.groupCount++; } public String getName() { return name; } public int getGroupCount() { return groupCount; } public Map<String, Vertex> getNextVertByVal() { return nextVertByVal; } } }
Run Code Online (Sandbox Code Playgroud)
主要的（）

public static void main(String[] args) { List<String> lines = List.of( "Mary had a little lamb named Willy", "Mary had a little ham", "Old McDonald had a farm named Willy", "Willy had a little dog named ham", "( abc )", "( xyz )", "Visit Target Store", "Visit Walmart Store"); GraphManager gm = GraphManager.getInstance(lines, " "); System.out.println("Prefixes:"); for (Map.Entry<String, Integer> entry: gm.getPrefixMap().entrySet()) { System.out.println(entry.getValue() + " " + entry.getKey()); } System.out.println("\nSuffixes:"); for (Map.Entry<String, Integer> entry: gm.getSuffixMap().entrySet()) { System.out.println(entry.getValue() + " " + entry.getKey()); } }
Run Code Online (Sandbox Code Playgroud)
输出

Prefixes: 2 Mary had a little 2 Visit 2 ( Suffixes: 2 ham 2 ) 2 Store 2 Willy named
Run Code Online (Sandbox Code Playgroud)

Abhinav 在技术上是正确的，尽管这似乎只是对原始帖子中模糊定义的分歧。OP的示例主要显示由整个单词组成的前缀，而他们的代码生成所有前缀，包括以单词中间结尾的前缀。为了方便起见，trie 数据结构通常包含空字符串作为公共根。根据定义，空字符串是所有字符串的前缀。另外，稍作更正，“簇”一词在图论中没有非常常见的含义：您描述的词通常称为“组件” (2认同)

这个答案很好地说明了，所采取方法的决定非常详细地清楚地阐明，并且所提供的实现也得到了很好的解释。我确实同意 kcsquared 的观点，即 OP 遗漏了许多细节。我也同意空集是所有集合的子集。然而，这只是正确的，因为它是这样在数学上定义的。在我看来，断言空字符串是所有字符串的有效前缀可能有点牵强。 (2认同)

归档时间：	3 年，5 月前
查看次数：	2083 次
最近记录：	3 年，5 月前