如何使用pickle文件组织Python项目?

Mic*_*ael 6 python git pickle

我来自 Java 背景,对 Python 来说是全新的。

现在我有一个 Python 项目,它由一些pickle存储在 Git 中的Python 脚本和文件组成。pickle 文件是序列化的sklearn模型。

我想知道如何组织这个项目。我认为我们不应该在 Git 中存储泡菜文件。我们可能应该将它们作为二进制依赖项存储在某处。

是否有意义 ?Python项目二进制依赖的常用存储方式是什么

drd*_*man 6

Git is just fine with binary data. For example, many projects store e.g. images in git repos.

I guess, the rule of thumb is to decide whenever your binary files are source material, an external dependency, or an intermediate build step. Of course, there are no strict rules, so just decide how you feel about them. Here are my suggestions:

  1. If they're (reproducibly) generated from something, .gitignore the binaries and have scripts that build the necessary data. It could be in the same, or in a separate repo - depending on where it feels best.

  2. Same logic applies if they're obtained from some external source, e.g. an external download. Usually, we don't store dependencies in the repository - we only keep references to them. E.g. we don't keep virtualenvs but only have requirements.txt file - the Java world analogy is (a rough approximation) like not having .jars but only pom.xml or a dependencies section in build.gradle.

  3. If they can be considered to be a source material, e.g. if you manipulate them with Python as an editor - don't worry about the files' binary nature and just have them in your repository.

  4. 如果它们不是真正的源材料,但它们的生成过程确实很复杂或需要很长时间,并且文件不打算定期更新 - 我认为将它们放在里面不会有什么大错回购。当然,留下一个关于文件是如何生成的注释(README.txt 或其他东西)是个好主意。

哦,如果文件很大(比如数百兆字节或更多),请考虑查看 git-lfs。