如何计算git哈希?

The*_*hat 21 git hash

我试图理解git如何计算refs的哈希值.

$ git ls-remote https://github.com/git/git  

....
29932f3915935d773dc8d52c292cadd81c81071d    refs/tags/v2.4.2
9eabf5b536662000f79978c4d1b6e4eff5c8d785    refs/tags/v2.4.2^{}
....
Run Code Online (Sandbox Code Playgroud)

在本地克隆回购.refs/tags/v2.4.2^{}通过sha 检查ref

$ git cat-file -p 9eabf5b536662000f79978c4d1b6e4eff5c8d785 

tree 655a20f99af32926cbf6d8fab092506ddd70e49c
parent df08eb357dd7f432c3dcbe0ef4b3212a38b4aeff
author Junio C Hamano <gitster@pobox.com> 1432673399 -0700
committer Junio C Hamano <gitster@pobox.com> 1432673399 -0700

Git 2.4.2

Signed-off-by: Junio C Hamano <gitster@pobox.com>
Run Code Online (Sandbox Code Playgroud)

复制解压缩的内容,以便我们可以散列它.(AFAIK git在散列时使用未压缩的版本)

git cat-file -p 9eabf5b536662000f79978c4d1b6e4eff5c8d785 > fi
Run Code Online (Sandbox Code Playgroud)

让我们使用git自己的哈希命令对内容进行SHA-1

git hash-object fi
3cf741bbdbcdeed65e5371912742e854a035e665
Run Code Online (Sandbox Code Playgroud)

为什么输出不是[9e]abf5b536662000f79978c4d1b6e4eff5c8d785?我理解前两个字符(9e)是十六进制的长度.我该如何散列内容fi以便我可以获得git ref abf5b536662000f79978c4d1b6e4eff5c8d785

Mat*_*aun 37

This video by John Williams gives an overview of what data goes into the calculation of a Git commit hash. Here\'s a screenshot from the video:

\n

Git 树

\n

Reimplementing the commit hash without Git

\n

To get a deeper understanding of this aspect of Git, I reimplemented the steps that produce a Git commit hash in Rust, without using Git. It currently works for getting the hash when committing a single file. The answers here were helpful in achieving this, thanks.

\n

The source code of this answer is available here. Execute it with cargo run.

\n

These are the individual pieces of data we need to compute to arrive at a Git commit hash:

\n
    \n
  1. The object ID of the file, which involves hashing the file contents with SHA-1. In Git, hash-object provides this ID.
  2. \n
  3. The object entries that go into the tree object. In Git, you can get an idea of those entries with ls-tree, but their format in the tree object is slightly different: [mode] [file name]\\0[object ID]
  4. \n
  5. The hash of the tree object which has the form: tree [size of object entries]\\0[object entries]. In Git, get the tree hash with: git cat-file commit HEAD | head -n1
  6. \n
  7. The commit hash by hashing the data you see with cat-file. This includes the tree object hash and commit information like author, time, commit message, and the parent commit hash if it\'s not the first commit.
  8. \n
\n

Each step depends on the previous one. Let\'s start with the first.

\n

Get the object ID of the file

\n

The first step is to reimplement Git\'s hash-object, as in git hash-object your_file.

\n

We create the object hash from our file by concatenating and hashing these data:

\n
    \n
  • The string "blob " at the beginning (mind the trailing space), followed by
  • \n
  • the number of bytes in the file, followed by
  • \n
  • a null byte, expressed with \\0 in printf and Rust, followed by
  • \n
  • the file content.
  • \n
\n

In Bash:

\n
file_name="your_file";\nprintf "blob $(wc -c < "$file_name")\\0$(cat "$file_name")" | sha1sum\n
Run Code Online (Sandbox Code Playgroud)\n

In Rust:

\n
// Get the object ID\nfn git_hash_object(file_content: &[u8]) -> Vec<u8> {\n    let file_size = file_content.len().to_string();\n    let hash_input = [\n        "blob ".as_bytes(),\n        file_size.as_bytes(),\n        b"\\0",\n        file_content,\n    ]\n    .concat();\n    to_sha1(&hash_input)\n}\n
Run Code Online (Sandbox Code Playgroud)\n

I\'m using crate sha1 version 0.10.5 in to_sha1:

\n
fn to_sha1(hash_me: &[u8]) -> Vec<u8> {\n    use sha1::{Digest, Sha1};\n\n    let mut hasher = Sha1::new();\n    hasher.update(hash_me);\n    hasher.finalize().to_vec()\n}\n
Run Code Online (Sandbox Code Playgroud)\n

Get the object entry of the file

\n

Object entries are part of Git\'s tree object. Tree objects represent files and directories.

\n

Object entries for files have this form: [mode] [file name]\\0[object ID]

\n

We assume the file is a regular, non-executable file, which translates to mode 100644 in Git. See this for more on modes.

\n

This Rust function takes the result of the previous function git_hash_object as the parameter object_id:

\n
fn object_entry(file_name: &str, object_id: &[u8]) -> Vec<u8> {\n    // It\'s a regular, non-executable file\n    let mode = "100644";\n\n    // [mode] [file name]\\0[object ID]\n    let object_entry = [\n        mode.as_bytes(),\n        b" ",\n        file_name.as_bytes(),\n        b"\\0",\n        object_id,\n    ]\n    .concat();\n\n    object_entry\n}\n
Run Code Online (Sandbox Code Playgroud)\n

I tried to write the equivalent of object_entry in Bash, but Bash variables cannot contain null bytes. There are probably ways around that limitation, but I decided for now that if I can\'t have variables in Bash, the code would get quite difficult to understand. Edits providing a readable Bash equivalent are welcome.

\n

Get the tree object hash

\n

As mentioned above, tree objects represent files and directories in Git. You can see the hash of your tree object by running, for example, git cat-file commit HEAD | head -n1.

\n

The tree object has this form: tree [size of object entries]\\0[object entries]

\n

In our case we only have a single object_entry, calculated in the previous step:

\n
fn tree_object_hash(object_entry: &[u8]) -> String {\n    let object_entry_size = object_entry.len().to_string();\n\n    let tree_object = [\n        "tree ".as_bytes(),\n        object_entry_size.as_bytes(),\n        b"\\0",\n        object_entry,\n    ]\n    .concat();\n\n    to_hex_str(&to_sha1(&tree_object))\n}\n
Run Code Online (Sandbox Code Playgroud)\n

Where to_hex_str is defined as:

\n
// Converts bytes to their hexadecimal representation.\nfn to_hex_str(bytes: &[u8]) -> String {\n    bytes.iter().map(|byte| format!("{byte:02x}")).collect()\n}\n
Run Code Online (Sandbox Code Playgroud)\n

In a Git repo, you can look at the contents of the tree object with ls-tree. For example, running git ls-tree HEAD will produce lines like these:

\n
100644 blob b8c0d74ef5ccd3dab583add7b3f5367efe4bf823    your_file\n
Run Code Online (Sandbox Code Playgroud)\n

While those lines contain the data of an object entry (the mode, the object ID, and the file name), they are in a different order and include a tab character as well as the string "blob" which is input to the object ID, not the object entry. Object entries have this form: [mode] [file name]\\0[object ID]

\n

Get the commit hash

\n

The last step creates the commit hash.

\n

The data we hash using SHA-1 includes:

\n
    \n
  • Tree object hash from the previous step.
  • \n
  • Hash of the parent commit if the commit is not the very first one in the repo.
  • \n
  • Author name and authoring date.
  • \n
  • Committer name and committing date.
  • \n
  • Commit message.
  • \n
\n

You can see all of that data with git cat-file commit HEAD, for example:

\n
tree a76b2df314b47956268b0c39c88a3b2365fb87eb\nparent 9881a96ab93a3493c4f5002f17b4a1ba3308b58b\nauthor Matthias Braun <m.braun@example.com> 1625338354 +0200\ncommitter Matthias Braun <m.braun@example.com> 1625338354 +0200\n\nSecond commit (that\'s the commit message)\n
Run Code Online (Sandbox Code Playgroud)\n

You might have guessed that 1625338354 is a timestamp. In this case it\'s the number of seconds since the Unix epoch. You can convert from the date and time format of git log, such as "Sat Jul 3 20:52:34 2021", to Unix epoch seconds with date:

\n
100644 blob b8c0d74ef5ccd3dab583add7b3f5367efe4bf823    your_file\n
Run Code Online (Sandbox Code Playgroud)\n

The time zone is denoted as +0200 in this example.

\n

Based on the output of cat-file, you can create the Git commit hash using this Bash command (which uses git cat-file, so it\'s no reimplementation):

\n
tree a76b2df314b47956268b0c39c88a3b2365fb87eb\nparent 9881a96ab93a3493c4f5002f17b4a1ba3308b58b\nauthor Matthias Braun <m.braun@example.com> 1625338354 +0200\ncommitter Matthias Braun <m.braun@example.com> 1625338354 +0200\n\nSecond commit (that\'s the commit message)\n
Run Code Online (Sandbox Code Playgroud)\n

The Bash command illustrates that\xe2\x80\x94similar to the steps before\xe2\x80\x94what we hash is:

\n
    \n
  • A leading string, "commit " in this step, followed by
  • \n
  • the size of a bunch of data. Here it\'s the output of cat-file which is detailed above. Followed by
  • \n
  • a null byte, followed by
  • \n
  • the data itself (output of cat-file) with a line break at the end.
  • \n
\n

In case you kept score: Creating a Git commit hash involves using SHA-1 at least three times.

\n

Below is the Rust function for creating the Git commit hash. It uses the tree_object_hash produced in the previous step and a struct CommitMetaData which contains the rest of the data you see when calling git cat-file commit HEAD. The function also takes care of whether the commit has a parent commit or not.

\n
date --date=\'Sat Jul 3 20:52:34 2021\' +"%s"\n
Run Code Online (Sandbox Code Playgroud)\n

Here\'s CommitMetaData:

\n
cat_file_output=$(git cat-file commit HEAD);\nprintf "commit $(wc -c <<< "$cat_file_output")\\0$cat_file_output\\n" | sha1sum\n
Run Code Online (Sandbox Code Playgroud)\n

This function creates CommitMetaData where author and committer info are identical, which will be convenient when we run the program later:

\n
fn commit_hash(commit: &CommitMetaData, tree_object_hash: &str) -> Vec<u8> {\n    let author_line = format!(\n        "{} {}",\n        commit.author_name_and_email, commit.author_timestamp_and_timezone\n    );\n    let committer_line = format!(\n        "{} {}",\n        commit.committer_name_and_email, commit.committer_timestamp_and_timezone\n    );\n\n    // If it\'s the first commit, which has no parent,\n    // the line starting with "parent" is omitted\n    let parent_commit_line = match commit.parent_commit_hash {\n        Some(parent_commit_hash) => format!("\\nparent {parent_commit_hash}"),\n        None => "".to_string(),\n    };\n    let git_cat_file_str = format!(\n        "tree {}{}\\nauthor {}\\ncommitter {}\\n\\n{}\\n",\n        tree_object_hash, parent_commit_line, author_line, committer_line, commit.commit_message\n    );\n\n    let git_cat_file_len = git_cat_file_str.len().to_string();\n\n    let commit_object = [\n        "commit ".as_bytes(),\n        git_cat_file_len.as_bytes(),\n        b"\\0",\n        git_cat_file_str.as_bytes(),\n    ]\n    .concat();\n\n    // Return the Git commit hash\n    to_sha1(&commit_object)\n}\n
Run Code Online (Sandbox Code Playgroud)\n

Putting it all together

\n

As a summary and reminder, creating a Git commit hash consists of getting:

\n
    \n
  1. The object ID of the file, which involves hashing the file contents with SHA-1. In Git, hash-object provides this ID.
  2. \n
  3. The object entries that go into the tree object. In Git, you can get an idea of those entries with ls-tree, but their format in the tree object is slightly different: [mode] [file name]\\0[object ID]
  4. \n
  5. The hash of the tree object which has the form: tree [size of object entries]\\0[object entries]. In Git, get the tree hash with: git cat-file commit HEAD | head -n1
  6. \n
  7. The commit hash by hashing the data you see with cat-file. This includes the tree object hash and commit information like author, time, commit message, and the parent commit hash if it\'s not the first commit.
  8. \n
\n

In Rust:

\n
#[derive(Debug, Copy, Clone)]\npub struct CommitMetaData<\'a> {\n    pub(crate) author_name_and_email: &\'a str,\n    pub(crate) author_timestamp_and_timezone: &\'a str,\n    pub(crate) committer_name_and_email: &\'a str,\n    pub(crate) committer_timestamp_and_timezone: &\'a str,\n    pub(crate) commit_message: &\'a str,\n    // All commits after the first one have a parent commit\n    pub(crate) parent_commit_hash: Option<&\'a str>,\n}\n
Run Code Online (Sandbox Code Playgroud)\n

With the functions above, you can create a file\'s Git commit hash in Rust, without Git:

\n
pub fn simple_commit<\'a>(\n    author_name_and_email: &\'a str,\n    author_timestamp_and_timezone: &\'a str,\n    commit_message: &\'a str,\n    parent_commit_hash: Option<&\'a str>,\n) -> CommitMetaData<\'a> {\n    CommitMetaData {\n        author_name_and_email,\n        author_timestamp_and_timezone,\n        committer_name_and_email: author_name_and_email,\n        committer_timestamp_and_timezone: author_timestamp_and_timezone,\n        commit_message,\n        parent_commit_hash,\n    }\n}\n
Run Code Online (Sandbox Code Playgroud)\n

To create the hash of the second commit, you take the hash of the first commit and put it into the CommitMetaData of the second commit:

\n
pub fn get_commit_hash(\n    file_name: &str,\n    file_content: &[u8],\n    commit: &CommitMetaData\n) -> String {\n    let file_object_id = git_hash_object(file_content);\n    let object_entry = object_entry(file_name, &file_object_id);\n    let tree_object_hash = tree_object_hash(&object_entry);\n\n    let commit_hash = commit_hash(commit, &tree_object_hash);\n    to_hex_str(&commit_hash)\n}\n
Run Code Online (Sandbox Code Playgroud)\n
\n

除了此处的其他答案及其链接之外,这些是创建我的有限重新实现的一些有用资源:

\n\n


Von*_*onC 11

如" 如何形成git commit sha1 "中所述,公式为:

(printf "<type> %s\0" $(git cat-file <type> <ref> | wc -c); git cat-file <type> <ref>)|sha1sum
Run Code Online (Sandbox Code Playgroud)

提交 9eabf5b536662000f79978c4d1b6e4eff5c8d785 的情况下(也就是v2.4.2^{}引用树):

(printf "commit %s\0" $(git cat-file commit 9eabf5b536662000f79978c4d1b6e4eff5c8d785 | wc -c); git cat-file commit 9eabf5b536662000f79978c4d1b6e4eff5c8d785 )|sha1sum
Run Code Online (Sandbox Code Playgroud)

这将给出9eabf5b536662000f79978c4d1b6e4eff5c8d785.

如:

(printf "commit %s\0" $(git cat-file commit v2.4.2{} | wc -c); git cat-file commit v2.4.2{})|sha1sum
Run Code Online (Sandbox Code Playgroud)

(仍然是9eabf5b536662000f79978c4d1b6e4eff5c8d785)

同样,计算标签v2.4.2的SHA1将是:

(printf "tag %s\0" $(git cat-file tag v2.4.2 | wc -c); git cat-file tag v2.4.2)|sha1sum
Run Code Online (Sandbox Code Playgroud)

这将给出29932f3915935d773dc8d52c292cadd81c81071d.


Sul*_*lli 5

这里有点混乱.Git使用不同类型的对象:blob,trees和commit.以下命令:

git cat-file -t <hash>
Run Code Online (Sandbox Code Playgroud)

告诉您给定哈希的对象类型.因此,在您的示例中,散列9eabf5b536662000f79978c4d1b6e4eff5c8d785对应于提交对象.

现在,当你想出自己,运行这个:

git cat-file -p 9eabf5b536662000f79978c4d1b6e4eff5c8d785
Run Code Online (Sandbox Code Playgroud)

根据类型(在本例中为提交)提供对象的内容.

但是这个:

git hash-object fi
Run Code Online (Sandbox Code Playgroud)

...计算blob的哈希,其内容是上一个命令的输出(在您的示例中),但它可能是其他任何内容(如"hello world!").试试这个:

echo "blob 277\0$(cat fi)" | shasum
Run Code Online (Sandbox Code Playgroud)

输出与上一个命令相同.这基本上是Git哈希的一个blob.因此,通过散列fi,您将生成blob对象.但正如我们所见,9eabf5b536662000f79978c4d1b6e4eff5c8d785是一个提交,而不是一个blob.所以,你不能为了得到相同的哈希而散列fi.

提交的哈希基于其他一些使其独特的信息(例如提交者,作者,日期等).以下文章准确地告诉您提交的哈希是什么:

git提交的解剖

因此,您可以通过提供与文章中指定的所有数据完全相同的值来获得相同的哈希值.

这可能也有帮助:

Git从下往上

  • `echo "blob 277\0$(cat fi)" | shasum` 对我来说产生了与 `git hash-object fi` 不同的结果,原因有两个:首先,我不知道 277 指的是 `fi` 的大小,并且我的特定 `fi` 的大小不等于 277其次,GNU coreutils 版本的 `echo` 添加了一个换行符,并且不会转义 `\0` 来表示 NUL 字节(`echo -en` 修复了这个问题)。以下命令产生与 `git hash-object fi` 相同的结果: `printf "blob $(wc -c &lt; fi)\0$(cat fi)" | sha1sum`。 (2认同)