Remove all information from a .git directory that can be re-downloaded

Question

Remove all information from a .git directory that can be re-downloaded

I have a git repository that, when just checked out, takes around 2.3 GiB even in the shallowest configuration, of which 1.9 GiB is inside .git/objects/pack. The working tree files are just about .5 GiB.

Considering I have a remote from which I can re-fetch all the objects if needed, the question is:

What (and how) can I delete from inside .git everything that I could then re-fetch safely, with simple git commands, from the remote?

Testing a bit, I found out that if I delete everything under .git/objects/pack/, it will be re-downloaded from the remote with a simple git fetch.

There are some complaints like:

error: refs/heads/master does not point to a valid object!
error: refs/remotes/origin/master does not point to a valid object!
error: refs/remotes/origin/HEAD does not point to a valid object!

Run Code Online (Sandbox Code Playgroud)

But then .git/objects/pack gets repopulated and further calls to git fetch don't complain anymore.

Is it safe to nuke .git/objects/pack* like this?

Assumptions:

There are no local-only commits in the repo or any form of git manipulation (like adding/removing objects from the stage), just checking out a specific branch in shallow mode.
The remote won't be rewriting history for the checked out branches.
I have no control whatsoever over the contents of the remote repository itself. It's a dependency of my project, but a fast changing one that is only available as git, and I want instructions for automated use in a continuous integration setting. Tips on how to modify the repository itself to make it take less space aren't going to help.
As I mentioned earlier, 1.9 GiB is for a shallow clone of the one branch I'm interested. It's a lot bigger than that when it's non-shallow, due to it's long history (open-source project that has over 10 years).
There are other repositories checked out in the same continuous-integration pipeline and I'd like to apply the same reduction of redundant-with-remote info in all of them.

The intent is to reduce as much as possible the amount of space taken by artifacts from a continuous-integration pipeline, but retaining enough information so that a those artifacts could be downloaded and restored to working order in the developer workstation with as little (and as normal) commands as possible.

Answer 1

Leo*_*ael 0

我可以从 .git 内部删除什么（以及如何）删除我可以使用简单的 git 命令从远程安全地重新获取的所有内容？

一切又如何呢？

如果您不想担心.git某些内容的内部结构以及是否可恢复，您可以保存足够的信息以再次检查所有内容，并将工作区恢复到与在 CI 中运行时功能相似的状态管道。

在 CI 管道中

在某处添加这样的文件（我们称之为degit.sh）

#!/bin/bash
set -ex
GIT_REMOTE=$( git remote get-url origin )
GIT_BRANCH=$( git rev-parse --abbrev-ref HEAD )
GIT_COMMIT=$( git rev-parse HEAD )

# TABs, not spaces, indenting the block below:
cat <<-EOF > .gitrestore
    set -ex
    test ! -e .git
    tmpclone=\$( mktemp -d --tmpdir=. )
    git clone $GIT_REMOTE -n --branch=$GIT_BRANCH \$tmpclone
    ( cd \$tmpclone ; git reset --hard $GIT_COMMIT )
    mv \$tmpclone/.git .
    rm -rf "\$tmpclone"
    rm -f \$0
EOF

rm -rf .git

Run Code Online (Sandbox Code Playgroud)

然后，在持续集成 (CI) 工作区的每个 git 存储库的根目录中调用它，以便它生成一个.gitrestore文件。

它看起来像这样：

set -ex
test ! -e .git
tmpclone=$( mktemp -d --tmpdir=. )
git clone git@example.com:example/repo.git -n --branch=example-branch $tmpclone
mv $tmpclone/.git .
git reset --hard example-commit-hash
rm -rf "$tmpclone"
rm -f $0

Run Code Online (Sandbox Code Playgroud)

请注意，它在运行成功后会自毁。您不想运行两次。

在开发机器中

现在，您的开发人员可以获取 CI 工件并在每个存储库中运行：

bash .gitrestore

Run Code Online (Sandbox Code Playgroud)

它将拥有一个看起来与 CI 管道非常相似的存储库，除了更新的远程视图之外，它允许开发人员将 CI 拥有的内容与她拥有的内容进行比较。

其他考虑因素

这假设只有 CI 机器受到空间限制，而开发人员机器（也不是她的带宽）受到限制。

如果您想节省开发人员端的空间/带宽，您可以传递--depth=1，这将仅克隆指定的分支（即，它意味着--single-branch并将历史记录限制为单个提交。

归档时间：	8 年，11 月前
查看次数：	1034 次
最近记录：	8 年，1 月前