小编Moh*_*ali的帖子

Python Git diff解析器

我想用Python代码解析git diff,我有兴趣从diff解析器获取以下信息:

已删除/添加的行的内容以及行号.
文件名.
文件的状态,无论是删除,重命名还是添加.

我为此目的使用unidiff 0.5.2并且我编写了以下代码:

    from unidiff import PatchSet
    import git
    import os

    commit_sha1 = 'b4defafcb26ab86843bbe3464a4cf54cdc978696'
    repo_directory_address = '/my/git/repo'
    repository = git.Repo(repo_directory_address)
    commit = repository.commit(commit_sha1)
    diff_index = commit.diff(commit_sha1+'~1', create_patch=True)
    diff_text = reduce(lambda x, y: str(x)+os.linesep+str(y), diff_index).split(os.linesep)
    patch = PatchSet(diff_text)
    print patch[0].is_added_file

Run Code Online (Sandbox Code Playgroud)

我正在使用GitPython来生成Git diff.我收到以下代码的以下错误:

    current_file = PatchedFile(source_file, target_file,
    UnboundLocalError: local variable 'source_file' referenced before assignment

Run Code Online (Sandbox Code Playgroud)

如果你能帮助我解决这个错误,我将不胜感激.

python git parsing gitpython

Moh*_*ali

2016 09-11

6
推荐指数

1
解决办法

4107
查看次数

使用 Python 将大数据流写入 Parquet

我想用 Python 将大数据流写入镶木地板文件。我的数据很大，我无法将它们保存在内存中并一口气写入它们。

我找到了两个可以在 Parquet 文件上读写的 Python 库（Pyarrow、Fastparquet）。这是我使用 Pyarrow 的解决方案，但如果您知道一个可行的解决方案，我很乐意尝试另一个库：

import pandas as pd
import random
import pyarrow as pa
import pyarrow.parquet as pq


def data_generator():
    # This is a simulation for my generator function
    # It is not allowed to change the nature of this function
    options = ['op1', 'op2', 'op3', 'op4']
    while True:
        dd = {'c1': random.randint(1, 10), 'c2': random.choice(options)}
        yield dd


result_file_address = 'example.parquet'
index = 0

try:
    dic_data = next(data_generator())
    df = pd.DataFrame(dic_data, [index])
    table = …

Run Code Online (Sandbox Code Playgroud)

python streaming bigdata parquet pyarrow

Moh*_*ali

2019 10-31

5
推荐指数

0
解决办法

1289
查看次数