java 8，从列表中返回重复项的最有效方法（不删除它们）？

Question

java 8，从列表中返回重复项的最有效方法（不删除它们）？

Jon*_*lin 5 java optimization list duplicates java-stream

我有一个ArrayList字符串，我想查找并返回在列表中多次存在的所有值。大多数情况都在寻找相反的情况（删除重复项，如 distinct()），因此很难获得示例代码。

我能够想出这个：

public synchronized List<String> listMatching(List<String> allStrings) {

    long startTime = System.currentTimeMillis();

    List<String> duplicates = allStrings.stream().filter(string -> Collections.frequency(allStrings, string) > 1)
            .collect(Collectors.toList());

    long stopTime = System.currentTimeMillis();
    long elapsedTime = stopTime - startTime;
    LOG.info("Time for Collections.frequency(): "+ elapsedTime);

    return duplicates;
}

Run Code Online (Sandbox Code Playgroud)

但这使用Collections.frequency, 循环遍历每个项目的整个列表并计算每次出现的次数。在我当前的大约4,000 个字符串列表上运行大约需要150 毫秒。这对我来说有点慢，并且随着列表大小的增加只会变得更糟。我采用了频率方法并将其重写为在第二次出现时立即返回：

protected boolean moreThanOne(Collection<?> c, Object o) { boolean found = false; if (o != null) { for (Object e : c) { if (o.equals(e)) { if (found) { return found; } else { found = true; } } } } return found; }
Run Code Online (Sandbox Code Playgroud)
并改变了我的使用方法：

public synchronized List<String> listMatching(List<String> allStrings) { long startTime = System.currentTimeMillis(); List<String> duplicates = allStrings.stream().filter(string -> moreThanOne(allStrings, string)) .collect(Collectors.toList()); long stopTime = System.currentTimeMillis(); long elapsedTime = stopTime - startTime; LOG.info("Time for moreThanOne(): "+ elapsedTime); return duplicates; }
Run Code Online (Sandbox Code Playgroud)
这似乎按预期工作，但并没有像我希望的那样真正提高速度，在大约120ms计时。这可能是因为它还需要遍历每个项目的整个列表，但我不确定如何避免这种情况并仍然完成任务。

我知道这可能看起来像过早的优化，但我的 List 很容易达到1mil+，而且这个方法是我的应用程序的一个关键部分，它会影响其他一切的时间安排。

你有什么办法可以进一步优化这段代码吗？也许使用某种花哨的谓词？完全不同的方法？

编辑： 感谢您的所有建议，我能够以更快的速度提出一些建议：

public synchronized Set<String> listMatching(List<String> allStrings) { Set<String> allItems = new HashSet<>(); Set<String> duplicates = allStrings.stream() .filter(string -> !allItems.add(string)) .collect(Collectors.toSet()); return duplicates; }
Run Code Online (Sandbox Code Playgroud)
在相同条件下运行，这能够在<5ms 内通过我的列表。如果我需要知道计数的话，所有的 HashMap 建议都会很棒。不知道为什么该Collections.frequency()方法不使用该技术。

Answer 1

dsp*_*ano 5

查找重复项的一个简单方法是迭代列表并使用 add() 方法将项目添加到其他临时集。如果该项目已存在于集合中，它将返回 false。

public synchronized List<String> listMatching(List<String> allStrings) {
   Set<String> tempSet = new HashSet();
   Set<String> duplicates = new HashSet();

   allStrings.forEach( item -> {
       if (!tempSet.add(item)) duplicates.add(item);
   });

   return duplicates;
}

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，5 月前
查看次数：	7604 次
最近记录：	8 年，5 月前