我是 PySpark 的新手。我一直在用测试样本编写代码。一旦我在较大的文件(3GB 压缩)上运行代码。我的代码只是做一些过滤和连接。我不断收到有关 py4J 的错误。
任何帮助都会有用,并表示赞赏。
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
ss = SparkSession \
.builder \
.appName("Example") \
.getOrCreate()
ss.conf.set("spark.sql.execution.arrow.enabled", 'true')
df = ss.read.csv(directory + '/' + filename, header=True, sep=",")
# Some filtering and groupbys...
df.show()
Run Code Online (Sandbox Code Playgroud)
返回
Py4JJavaError: An error occurred while calling o88.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in
stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID
1, localhost, executor driver): java.lang.OutOfMemoryError: …Run Code Online (Sandbox Code Playgroud) 我正在做一个有1000个系数的LassoCV.Statsmodels似乎无法处理这么多系数.所以我正在使用scikit学习.Statsmodel允许.fit_constrained("coef1 + coef2 ... = 1").这将coefs的总和约束为= 1.我需要在Scikit中执行此操作.我也将拦截保持为零.
from sklearn.linear_model import LassoCV
LassoCVmodel = LassoCV(fit_intercept=False)
LassoCVmodel.fit(x,y)
Run Code Online (Sandbox Code Playgroud)
任何帮助,将不胜感激.
python regression machine-learning scikit-learn sklearn-pandas
请帮忙,我不知道为什么会发生此错误。我以前使用过这个代码没有任何问题。我希望这不是愚蠢的事情。总是感谢帮助。
版本:
蟒蛇 3.6
PD 0.23.0
xlsxwriter 1.0.4
writer = pd.ExcelWriter('Output.xlsx', engine='xlsxwriter')
workbook = writer.book
worksheet = writer.sheets['Sheet1']
Run Code Online (Sandbox Code Playgroud)
输出:
Traceback (most recent call last):
File "/opt/eclipse/dropins/plugins/org.python.pydev.core_7.2.0.201903251948/pysrc/_pydevd_bundle/pydevd_exec2.py", line 3, in Exec
exec(exp, global_vars, local_vars)
File "<console>", line 1, in <module>
KeyError: 'Sheet1'
Run Code Online (Sandbox Code Playgroud) python ×3
pandas ×1
py4j ×1
pyspark ×1
pyspark-sql ×1
regression ×1
scikit-learn ×1
xlsxwriter ×1