Remove all information from a .git directory that can be re-downloaded

Leo*_*ael 5 git

I have a git repository that, when just checked out, takes around 2.3 GiB even in the shallowest configuration, of which 1.9 GiB is inside .git/objects/pack. The working tree files are just about .5 GiB.

Considering I have a remote from which I can re-fetch all the objects if needed, the question is:

  • What (and how) can I delete from inside .git everything that I could then re-fetch safely, with simple git commands, from the remote?

Testing a bit, I found out that if I delete everything under .git/objects/pack/, it will be re-downloaded from the remote with a simple git fetch.

There are some complaints like:

error: refs/heads/master does not point to a valid object!
error: refs/remotes/origin/master does not point to a valid object!
error: refs/remotes/origin/HEAD does not point to a valid object!
Run Code Online (Sandbox Code Playgroud)

But then .git/objects/pack gets repopulated and further calls to git fetch don't complain anymore.

Is it safe to nuke .git/objects/pack* like this?

Assumptions:

  • There are no local-only commits in the repo or any form of git manipulation (like adding/removing objects from the stage), just checking out a specific branch in shallow mode.
  • The remote won't be rewriting history for the checked out branches.
  • I have no control whatsoever over the contents of the remote repository itself. It's a dependency of my project, but a fast changing one that is only available as git, and I want instructions for automated use in a continuous integration setting. Tips on how to modify the repository itself to make it take less space aren't going to help.
  • As I mentioned earlier, 1.9 GiB is for a shallow clone of the one branch I'm interested. It's a lot bigger than that when it's non-shallow, due to it's long history (open-source project that has over 10 years).
  • There are other repositories checked out in the same continuous-integration pipeline and I'd like to apply the same reduction of redundant-with-remote info in all of them.

The intent is to reduce as much as possible the amount of space taken by artifacts from a continuous-integration pipeline, but retaining enough information so that a those artifacts could be downloaded and restored to working order in the developer workstation with as little (and as normal) commands as possible.

Leo*_*ael 0

  • 我可以从 .git 内部删除什么(以及如何)删除我可以使用简单的 git 命令从远程安全地重新获取的所有内容?

一切又如何呢?

如果您不想担心.git某些内容的内部结构以及是否可恢复,您可以保存足够的信息以再次检查所有内容,并将工作区恢复到与在 CI 中运行时功能相似的状态管道。

在 CI 管道中

在某处添加这样的文件(我们称之为degit.sh

#!/bin/bash
set -ex
GIT_REMOTE=$( git remote get-url origin )
GIT_BRANCH=$( git rev-parse --abbrev-ref HEAD )
GIT_COMMIT=$( git rev-parse HEAD )

# TABs, not spaces, indenting the block below:
cat <<-EOF > .gitrestore
    set -ex
    test ! -e .git
    tmpclone=\$( mktemp -d --tmpdir=. )
    git clone $GIT_REMOTE -n --branch=$GIT_BRANCH \$tmpclone
    ( cd \$tmpclone ; git reset --hard $GIT_COMMIT )
    mv \$tmpclone/.git .
    rm -rf "\$tmpclone"
    rm -f \$0
EOF

rm -rf .git
Run Code Online (Sandbox Code Playgroud)

然后,在持续集成 (CI) 工作区的每个 git 存储库的根目录中调用它,以便它生成一个.gitrestore文件。

它看起来像这样:

set -ex
test ! -e .git
tmpclone=$( mktemp -d --tmpdir=. )
git clone git@example.com:example/repo.git -n --branch=example-branch $tmpclone
mv $tmpclone/.git .
git reset --hard example-commit-hash
rm -rf "$tmpclone"
rm -f $0
Run Code Online (Sandbox Code Playgroud)

请注意,它在运行成功后会自毁。您不想运行两次。

在开发机器中

现在,您的开发人员可以获取 CI 工件并在每个存储库中运行:

bash .gitrestore
Run Code Online (Sandbox Code Playgroud)

它将拥有一个看起来与 CI 管道非常相似的存储库,除了更新的远程视图之外,它允许开发人员将 CI 拥有的内容与她拥有的内容进行比较。

其他考虑因素

这假设只有 CI 机器受到空间限制,而开发人员机器(也不是她的带宽)受到限制。

如果您想节省开发人员端的空间/带宽,您可以传递--depth=1,这将仅克隆指定的分支(即,它意味着--single-branch并将历史记录限制为单个提交。