使用exec使用动态逻辑修改熊猫数据框

Question

使用exec使用动态逻辑修改熊猫数据框

假设我有一个脚本，该脚本从数据库中将数据读取到数据帧中，在该数据帧上运行一些逻辑，然后将结果数据帧导出到另一个数据库表中，如下所示。问题是exec函数之后，transform.py中的数据帧不会被覆盖。

注意：这是一个简单的示例，用于说明问题，而不是我尝试使用此方法解决的实际问题。

期望：

执行前

+---------+---------------+--------------+----------+
| metric  | modified_date | current_date | datediff |
+---------+---------------+--------------+----------+
| metric1 | 2019-03-31    | 2019-05-03   |       33 |
| metric2 | 2019-03-31    | 2019-05-03   |       33 |
| metric3 | 2019-03-31    | 2019-05-03   |       33 |
| metric4 | 2019-03-20    | 2019-05-03   |       44 |
+---------+---------------+--------------+----------+

Run Code Online (Sandbox Code Playgroud)

执行后

+---------+---------------+--------------+----------+
| metric  | modified_date | current_date | datediff |
+---------+---------------+--------------+----------+
| metric4 | 2019-03-20    | 2019-05-03   |       44 |
+---------+---------------+--------------+----------+

Run Code Online (Sandbox Code Playgroud)

实际：

执行前

+---------+---------------+--------------+----------+
| metric  | modified_date | current_date | datediff |
+---------+---------------+--------------+----------+
| metric1 | 2019-03-31    | 2019-05-03   |       33 |
| metric2 | 2019-03-31    | 2019-05-03   |       33 |
| metric3 | 2019-03-31    | 2019-05-03   |       33 |
| metric4 | 2019-03-20    | 2019-05-03   |       44 |
+---------+---------------+--------------+----------+

Run Code Online (Sandbox Code Playgroud)

执行后

+---------+---------------+--------------+----------+
| metric  | modified_date | current_date | datediff |
+---------+---------------+--------------+----------+
| metric1 | 2019-03-31    | 2019-05-03   |       33 |
| metric2 | 2019-03-31    | 2019-05-03   |       33 |
| metric3 | 2019-03-31    | 2019-05-03   |       33 |
| metric4 | 2019-03-20    | 2019-05-03   |       44 |
+---------+---------------+--------------+----------+

Run Code Online (Sandbox Code Playgroud)

他们是一样的！

转换

def dataframe_transform(logic, source_table, dest_table, database, existing_rows='truncate'):
    ...
    df = table_to_df(table=source_table, database=database)

    try:
        exec(logic)
    except Exception:
        raise

    result = df_to_table(dataframe=df, database=database, table=dest_table, existing_rows=existing_rows)

    return result

Run Code Online (Sandbox Code Playgroud)

逻辑筛选出数据框以查找需要更新的记录，并启动另一个进程，并使用新的筛选后的数据覆盖原始数据框。

逻辑文件

# This is just an example I made up - please don't focus on solving this.

late_df = pd.DataFrame()

# Check if data is late
late_cutoff = 40
for index, row in df.iterrows():
    if row['datediff'] >= late_cutoff:
        late_df = late_df.append(row, ignore_index=True)

... # Do something else

df = late_df # Save flagged records by updating the original dataframe.

Run Code Online (Sandbox Code Playgroud)

我为什么要这样做？在这种情况下，我知道输入是安全的，它使我可以将此代码重用于各种脚本并分离出转换逻辑。

Answer 1

Har*_*vey 5

检查您的范围。从提供的代码中无法分辨，但我怀疑您的exec调用未正确管理范围（本地，全局）。“在Python 3中，exec是一个函数；它的使用对使用它的函数的编译字节码没有影响。” （来自eval，exec和compile有什么区别？）

另请参阅：https : //www.programiz.com/python-programming/methods/built-in/exec

个人意见：eval / exec是邪恶的，应避免。

其他人在评论中表达了最后一点。该代码示例显示，您仍在行中思考，并在基于行的操作（对于itterrows中的x）中将向量（df ['col']）与标量（late_cutoff）混合使用这是熊猫用户的常见问题，我确实对于其他人，在此类问题上进行了大量重构。如果您可以按照设计的方式使用熊猫来更改代码以使用熊猫，那么程序的速度将提高一个数量级：无循环且不更改原始数据。一次读取-使用更改后的数据创建一个新的数据框，而无需使用迭代器-一次写入。如果必须循环，请创建一组键并遍历该键以创建矢量化操作：

keys = set(df['key_col'])
for key in keys:
    dfx = df[df[key > limit]]

Run Code Online (Sandbox Code Playgroud)

这也可能对您有用（请参阅“编写许多逻辑以提高写速度”）使用SQLAlchemy批量插入Pandas DataFrame

归档时间：	6 年，6 月前
查看次数：	211 次
最近记录：	6 年，6 月前