ank*_*kit 5 apache-spark pyspark
我使用以下语句将数据框保存到PySpark中的CSV文件中:
df_all.repartition(1).write.csv("xyz.csv", header=True, mode='overwrite')
Run Code Online (Sandbox Code Playgroud)
但是我正在错误以下
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 218, in main
func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 138, in read_udfs
arg_offsets, udf = read_single_udf(pickleSer, infile, eval_type)
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 118, in read_single_udf
f, return_type = read_command(pickleSer, infile)
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 58, in read_command
command = serializer._read_with_length(file)
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 170, in _read_with_length
return self.loads(obj)
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 559, in loads
return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'app'
Run Code Online (Sandbox Code Playgroud)
我正在使用PySpark 2.3.0版本
尝试写入文件时出现此错误。
import json, jsonschema
from pyspark.sql import functions
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType, StringType, FloatType
from datetime import datetime
import os
feb = self.filter_data(self.SRC_DIR + "tl_feb19.csv", 13)
apr = self.filter_data(self.SRC_DIR + "tl_apr19.csv", 15)
df_all = feb.union(apr)
df_all = df_all.dropDuplicates(subset=["PRIMARY_ID"])
create_emi_amount_udf = udf(create_emi_amount, FloatType())
df_all = df_all.withColumn("EMI_Amount", create_emi_amount_udf('Sanction_Amount', 'Loan_Type'))
df_all.write.csv(self.DST_DIR + "merged_amounts.csv", header=True, mode='overwrite')
Run Code Online (Sandbox Code Playgroud)
错误很明显,没有“app”模块。您的 Python 代码在驱动程序上运行,而您的 udf 在执行程序 PVM 上运行。当您调用udf 时,spark 将序列化create_emi_amount以将其发送给执行程序。
因此,在您的方法中的某处create_emi_amount使用或导入 app 模块。您的问题的解决方案是在驱动程序和执行程序中使用相同的环境。在spark-env.sh设置保存的Python的virtualenv中PYSPARK_DRIVER_PYTHON=...和PYSPARK_PYTHON=...。
| 归档时间: |
|
| 查看次数: |
217 次 |
| 最近记录: |