Ego*_*sky 5 python-3.x pandas pyspark databricks delta-lake
我正在使用 Databricks。对于我的数据,我创建了一个 DeltaLake。然后我尝试使用 pandas API 修改该列,但由于某种原因弹出以下错误消息:
ValueError: Cannot combine the series or dataframe because it comes from a different dataframe. In order to allow this operation, enable 'compute.ops_on_diff_frames' option.
Run Code Online (Sandbox Code Playgroud)
我使用以下代码重写表中的数据:
df_new = spark.read.format('delta').load(f"abfss://{container}@{storage_account_name}.dfs.core.windows.net/{delta_name}")
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from math import *
from pyspark.pandas.config import set_option
import pyspark.pandas as ps
%matplotlib inline
from pyspark.pandas.config import set_option
import pyspark.pandas as ps
win_len = 5000
# For this be sure you have runtime 1.11 or earlier version
df_new = df_new.pandas_api()
print('Creating Average active power for U1 and V1...')
df_new['p_avg1'] = df_new.Current1.mul(df_new['Voltage1']).rolling(min_periods=1, window=win_len).mean()
print('Creating Average active power for U2 and V2...')
df_new['p_avg2'] = df_new.Current2.mul(df_new['Voltage2']).rolling(min_periods=1, window=win_len).mean()
print('Creating Average active power for U3 and V3...')
df_new['p_avg3'] = df_new.Current3.mul(df_new['Voltage3']).rolling(min_periods=1, window=win_len).mean()
print('Creating Average active power for U4 and V4...')
df_new['p_avg4'] = df_new.Current4.mul(df_new['Voltage4']).rolling(min_periods=1, window=win_len).mean()
print('Converting to Spark dataframe')
df_new = df_new.to_spark()
print('Complete')
Run Code Online (Sandbox Code Playgroud)
以前使用 pandas API 没有问题,我使用的是最新的运行时 11.2。当我使用集群时,只加载了一个数据帧。
先感谢您。
错误消息提示:为了允许此操作,启用“compute.ops_on_diff_frames”选项
以下是根据文档启用此选项的方法:
import pyspark.pandas as ps
ps.set_option('compute.ops_on_diff_frames', True)
Run Code Online (Sandbox Code Playgroud)
该文档有这个重要警告:
默认情况下,Spark 上的 Pandas API 不允许对不同的 DataFrame(或 Series)进行操作,以防止昂贵的操作。它在内部执行连接操作,该操作通常会很昂贵。
| 归档时间: |
|
| 查看次数: |
4586 次 |
| 最近记录: |