如何使用Python计算文件系统目录的哈希值？

Question

如何使用Python计算文件系统目录的哈希值？

我正在使用此代码计算文件的哈希值:

m = hashlib.md5()
with open("calculator.pdf", 'rb') as fh:
    while True:
        data = fh.read(8192)
        if not data:
            break
        m.update(data)
    hash_value = m.hexdigest()

    print  hash_value

Run Code Online (Sandbox Code Playgroud)

当我在文件夹"文件夹"上尝试时,我得到了

IOError: [Errno 13] Permission denied: folder

Run Code Online (Sandbox Code Playgroud)

我怎么能计算文件夹的哈希值？

Answer 1

Man*_*hit 17

使用checksumdir python包可用于计算目录的校验和/哈希.它可以在https://pypi.python.org/pypi/checksumdir/1.0.5上找到

用法:

import checksumdir
hash = checksumdir.dirhash("c:\\temp")
print hash

Run Code Online (Sandbox Code Playgroud)

注意：“checksumdir”未经测试（即使声称稳定）。使用它（可以说）比使用菜谱“更不可靠”，这至少迫使你阅读菜谱。 (3认同)

Answer 2

小智 9

这是一个使用 pathlib.Path 而不是依赖 os.walk 的实现。它在迭代之前对目录内容进行排序，因此它应该可以在多个平台上重复。它还使用文件/目录的名称更新哈希，因此添加空文件和目录将更改哈希。

带有类型注释的版本（Python 3.6 或更高版本）：

import hashlib
from _hashlib import HASH as Hash
from pathlib import Path
from typing import Union


def md5_update_from_file(filename: Union[str, Path], hash: Hash) -> Hash:
    assert Path(filename).is_file()
    with open(str(filename), "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash.update(chunk)
    return hash


def md5_file(filename: Union[str, Path]) -> str:
    return str(md5_update_from_file(filename, hashlib.md5()).hexdigest())


def md5_update_from_dir(directory: Union[str, Path], hash: Hash) -> Hash:
    assert Path(directory).is_dir()
    for path in sorted(Path(directory).iterdir(), key=lambda p: str(p).lower()):
        hash.update(path.name.encode())
        if path.is_file():
            hash = md5_update_from_file(path, hash)
        elif path.is_dir():
            hash = md5_update_from_dir(path, hash)
    return hash


def md5_dir(directory: Union[str, Path]) -> str:
    return str(md5_update_from_dir(directory, hashlib.md5()).hexdigest())

Run Code Online (Sandbox Code Playgroud)

没有类型注释：

import hashlib
from pathlib import Path


def md5_update_from_file(filename, hash):
    assert Path(filename).is_file()
    with open(str(filename), "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash.update(chunk)
    return hash


def md5_file(filename):
    return md5_update_from_file(filename, hashlib.md5()).hexdigest()


def md5_update_from_dir(directory, hash):
    assert Path(directory).is_dir()
    for path in sorted(Path(directory).iterdir()):
        hash.update(path.name.encode())
        if path.is_file():
            hash = md5_update_from_file(path, hash)
        elif path.is_dir():
            hash = md5_update_from_dir(path, hash)
    return hash


def md5_dir(directory):
    return md5_update_from_dir(directory, hashlib.md5()).hexdigest()

Run Code Online (Sandbox Code Playgroud)

如果您只需要散列目录，则使用精简版：

def md5_update_from_dir(directory, hash):
    assert Path(directory).is_dir()
    for path in sorted(Path(directory).iterdir(), key=lambda p: str(p).lower()):
        hash.update(path.name.encode())
        if path.is_file():
            with open(path, "rb") as f:
                for chunk in iter(lambda: f.read(4096), b""):
                    hash.update(chunk)
        elif path.is_dir():
            hash = md5_update_from_dir(path, hash)
    return hash


def md5_dir(directory):
    return md5_update_from_dir(directory, hashlib.md5()).hexdigest()

Run Code Online (Sandbox Code Playgroud)

用法： md5_hash = md5_dir("/some/directory")

文件的排序对于可重复性非常重要，因为 Path iterdir 或 os.walk 都不能保证一定的顺序，并且将受到底层操作系统实现的影响。然而，仅按不区分大小写的路径排序是不够的，因为排序是就地的，并且如果 linux 中的两个文件夹仅大小写不同，则排序可以根据 iterdir/os.walk 返回的特定顺序不同时间返回不同的顺序。正确的解决方案是首先按大小写排序，然后按不区分大小写排序。@danmou 也许你可以更新这个？ (2认同)

Answer 3

And*_*ndy 8

这个食谱提供了一个很好的功能来做你要求的.我已修改它以使用MD5哈希,而不是SHA1,正如您的原始问题所要求的那样

def GetHashofDirs(directory, verbose=0):
  import hashlib, os
  SHAhash = hashlib.md5()
  if not os.path.exists (directory):
    return -1

  try:
    for root, dirs, files in os.walk(directory):
      for names in files:
        if verbose == 1:
          print 'Hashing', names
        filepath = os.path.join(root,names)
        try:
          f1 = open(filepath, 'rb')
        except:
          # You can't open the file for some reason
          f1.close()
          continue

        while 1:
          # Read file in as little chunks
          buf = f1.read(4096)
          if not buf : break
          SHAhash.update(hashlib.md5(buf).hexdigest())
        f1.close()

  except:
    import traceback
    # Print the stack traceback
    traceback.print_exc()
    return -2

  return SHAhash.hexdigest()

Run Code Online (Sandbox Code Playgroud)

你可以像这样使用它:

print GetHashofDirs('folder_to_hash', 1)

Run Code Online (Sandbox Code Playgroud)

输出看起来像这样,因为它散列每个文件:

...
Hashing file1.cache
Hashing text.txt
Hashing library.dll
Hashing vsfile.pdb
Hashing prog.cs
5be45c5a67810b53146eaddcae08a809

Run Code Online (Sandbox Code Playgroud)

此函数调用返回的值将作为哈希返回.在这种情况下,5be45c5a67810b53146eaddcae08a809

对我来说，仅仅忽略无法打开的文件听起来并不正确。此外，[您不能保证](http://stackoverflow.com/a/18282401/2436175) 例如，在不同的文件系统上 os.walk 将以相同的顺序导航文件。 (2认同)

Answer 4

Bry*_*ell 5

我不喜欢答案中引用的食谱是如何编写的。我有一个更简单的版本，我正在使用：

import hashlib
import os


def hash_directory(path):
    digest = hashlib.sha1()

    for root, dirs, files in os.walk(path):
        for names in files:
            file_path = os.path.join(root, names)

            # Hash the path and add to the digest to account for empty files/directories
            digest.update(hashlib.sha1(file_path[len(path):].encode()).digest())

            # Per @pt12lol - if the goal is uniqueness over repeatability, this is an alternative method using 'hash'
            # digest.update(str(hash(file_path[len(path):])).encode())

            if os.path.isfile(file_path):
                with open(file_path, 'rb') as f_obj:
                    while True:
                        buf = f_obj.read(1024 * 1024)
                        if not buf:
                            break
                        digest.update(buf)

    return digest.hexdigest()

Run Code Online (Sandbox Code Playgroud)

我发现每当alias遇到类似 an 的东西时通常都会抛出异常（显示在中os.walk()，但你不能直接打开它）。在os.path.isfile()检查过程中这些问题的照顾。

如果在我尝试散列的目录中存在实际文件并且无法打开它，则跳过该文件并继续不是一个好的解决方案。这会影响散列的结果。最好完全终止哈希尝试。在这里，try语句将围绕对我的hash_directory()函数的调用。

>>> try:
...   print(hash_directory('/tmp'))
... except:
...   print('Failed!')
... 
e2a075b113239c8a25c7e1e43f21e8f2f6762094
>>>

Run Code Online (Sandbox Code Playgroud)

归档时间：	11 年，4 月前
查看次数：	14079 次
最近记录：	6 年前