Scala 文件散列

Question

Scala 文件散列

我编写了以下程序作为使用其 MD5 哈希删除重复文件的快速实验

import java.nio.file.{Files, Paths}
import java.security.MessageDigest

object Test {

  def main(args: Array[String]) = {

    val startTime = System.currentTimeMillis();
    val byteArray = Files.readAllBytes(Paths.get("/Users/amir/pgns/bigPGN.pgn"))
    val endTime = System.currentTimeMillis();
    println("Read file into byte " +byteArray+ " in " + (endTime - startTime) +" ms");

    val startTimeHash = System.currentTimeMillis();
    val hash = MessageDigest.getInstance("MD5").digest(byteArray)
    val endTimeHash = System.currentTimeMillis();
    System.out.println("hashed file into " +hash+ " in " +(endTime - startTime)+ " ms");
  }
}

Run Code Online (Sandbox Code Playgroud)

我注意到当我的 pgn 文件大约有 1.5 GB 的文本数据时，读取文件大约需要 2.5 秒，散列它需要 2.5 秒。

我的问题是，如果我有大量文件，有没有更快的方法来做到这一点？

Answer 1

Ale*_*lec 5

是的，有：不要将所有文件读入内存！这是理论上应该更快的东西，虽然我没有任何大文件来测试它

import java.security.{MessageDigest, DigestInputStream}
import java.io.{File, FileInputStream}

// Compute a hash of a file
// The output of this function should match the output of running "md5 -q <file>"
def computeHash(path: String): String = {
  val buffer = new Array[Byte](8192)
  val md5 = MessageDigest.getInstance("MD5")

  val dis = new DigestInputStream(new FileInputStream(new File(path)), md5)
  try { while (dis.read(buffer) != -1) { } } finally { dis.close() }

  md5.digest.map("%02x".format(_)).mkString
}

Run Code Online (Sandbox Code Playgroud)

如果一切都按照我的想法行事，这将避免保留内存中的所有字节 - 当它读取块时，它会将它们直接消耗到哈希中。请注意，您可以增加缓冲区大小以加快速度...

归档时间：	8 年，9 月前
查看次数：	2932 次
最近记录：	8 年，9 月前