使用 1 个数据帧时“无法合并系列或数据帧，因为它来自不同的数据帧”

Question

使用 1 个数据帧时“无法合并系列或数据帧，因为它来自不同的数据帧”

Ego*_*sky 5 python-3.x pandas pyspark databricks delta-lake

我正在使用 Databricks。对于我的数据，我创建了一个 DeltaLake。然后我尝试使用 pandas API 修改该列，但由于某种原因弹出以下错误消息：

ValueError: Cannot combine the series or dataframe because it comes from a different dataframe. In order to allow this operation, enable 'compute.ops_on_diff_frames' option.

Run Code Online (Sandbox Code Playgroud)

我使用以下代码重写表中的数据：

df_new = spark.read.format('delta').load(f"abfss://{container}@{storage_account_name}.dfs.core.windows.net/{delta_name}")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from math import *
from pyspark.pandas.config import set_option
import pyspark.pandas as ps
%matplotlib inline

from pyspark.pandas.config import set_option
import pyspark.pandas as ps

win_len = 5000

# For this be sure you have runtime 1.11 or earlier version

df_new = df_new.pandas_api()
print('Creating Average active power for U1 and V1...')
df_new['p_avg1'] = df_new.Current1.mul(df_new['Voltage1']).rolling(min_periods=1, window=win_len).mean()

print('Creating Average active power for U2 and V2...')
df_new['p_avg2'] = df_new.Current2.mul(df_new['Voltage2']).rolling(min_periods=1, window=win_len).mean()

print('Creating Average active power for U3 and V3...')
df_new['p_avg3'] = df_new.Current3.mul(df_new['Voltage3']).rolling(min_periods=1, window=win_len).mean()

print('Creating Average active power for U4 and V4...')
df_new['p_avg4'] = df_new.Current4.mul(df_new['Voltage4']).rolling(min_periods=1, window=win_len).mean()

print('Converting to Spark dataframe')
df_new = df_new.to_spark()

print('Complete')

Run Code Online (Sandbox Code Playgroud)

以前使用 pandas API 没有问题，我使用的是最新的运行时 11.2。当我使用集群时，只加载了一个数据帧。

先感谢您。

Answer 1

Pow*_*ers 7

错误消息提示：为了允许此操作，启用“compute.ops_on_diff_frames”选项

以下是根据文档启用此选项的方法：

import pyspark.pandas as ps
ps.set_option('compute.ops_on_diff_frames', True)

Run Code Online (Sandbox Code Playgroud)

该文档有这个重要警告：

默认情况下，Spark 上的 Pandas API 不允许对不同的 DataFrame（或 Series）进行操作，以防止昂贵的操作。它在内部执行连接操作，该操作通常会很昂贵。

归档时间：	3 年，4 月前
查看次数：	4586 次
最近记录：	3 年，4 月前