pyspark:ERROR中的org.apache.thrift.transport.TTransportException

Der*_*ski 5 python apache-spark pyspark apache-zeppelin apache-spark-mllib

我正在使用Zeppelin笔记本/ Apache Spark,我经常收到以下错误:

org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)中的org.apache.thrift.transport.TTransportException位于org.apache.thrift.transport.TTransport.readAll(TTransport.java:86).位于org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)的apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol. java:219)org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)atg.apache.zeppelin.interpreter.thrift.RemoteInterpreterService $ Client.recv_interpret(RemoteInterpreterService.java:249)org.apache.zeppelin .interpreter.thrift.RemoteInterpreterService $ Client.interpret(RemoteInterpreterService.java:233)atg.apache.zeppelin.interpreter.remote.RemoteInterpreter.interpret(RemoteInterpreter.java:269)org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret (LazyOpenInterpreter.java:94)org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:279)org.apache.zeppelin.scheduler.Job.run(Job.java:176)at org.apache.zeppelin.scheduler.RemoteScheduler $ JobRunner.run(RemoteScheduler.java:328)at java .util.concurrent.Executors $ RunnableAdapter.call(Executors.java:511)at java.util.concurrent.FutureTask.run(FutureTask.java:266)at java.util.concurrent.ScheduledThreadPoolExecutor $ ScheduledFutureTask.access $ 201(ScheduledThreadPoolExecutor. java:180)at java.util.concurrent.ScheduledThreadPoolExecutor $ ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)at java.util.concurrent.ThreadPoolExecutor $ Worker .run(ThreadPoolExecutor.java:617)在java.lang.Thread.run(Thread.java:745)

如果我尝试再次运行相同的代码(只是忽略错误),我得到这个(只是顶线):

java.net.SocketException:管道损坏(写入失败)

然后,如果我尝试第三次(或任何时间后)运行它,我会收到此错误:

java.net.ConnectException:连接被拒绝(连接被拒绝)

如果我在Zeppelin Notebooks中重新启动解释器,那么它(最初)可以工作但最终我最终再次收到此错误.

在我的过程中的各个步骤(数据清理,矢量化等)中发生了这个错误,但是它最常出现的时间(到目前为止)是我正在拟合模型的时候.为了让您更好地了解我实际在做什么以及通常何时发生,我将引导您完成我的过程:

我正在使用Apache Spark ML并完成了一些标准的矢量化,加权等(CountVectorizer,IDF),然后在该数据上构建模型.

我使用VectorAssember创建我的特征向量,将其转换为密集向量,并将其转换为数据帧:

assembler = VectorAssembler(inputCols = ["fileSize", "hour", "day", "month", "punct_title", "cap_title", "punct_excerpt", "title_tfidf", "ct_tfidf", "excerpt_tfidf", "regex_tfidf"], outputCol="features")

vector_train = assembler.transform(train_raw).select("Target", "features")
vector_test = assembler.transform(test_raw).select("Target", "features")

train_final = vector_train.rdd.map(lambda x: Row(label=x[0],features=DenseVector(x[1].toArray())))
test_final = vector_test.rdd.map(lambda x: Row(label=x[0],features=DenseVector(x[1].toArray())))

train_final_df = sqlContext.createDataFrame(train_final)
test_final_df = sqlContext.createDataFrame(test_final)

Run Code Online (Sandbox Code Playgroud)

因此,进入模型的训练集看起来像这样(实际数据集有~15k列,我下采样到~5k示例只是为了试图让它运行):

[Row(features = DenseVector([7016.0,9.0,16.0,2.0,2.0,4.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0 0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.315,0.0,0.0,0.0,0.0, 0.0,0.0,0.0,0.0,..................... 0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0 ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.7,35,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0 ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],标签= 0)]

下一步是拟合模型,这是错误通常弹出的地方.我已经尝试了适合单个模型和运行CV(w/ParamGrid):

单一型号:

from pyspark.ml.classification import GBTClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator

gbt = GBTClassifier(labelCol="label", featuresCol="features", maxDepth=8, maxBins=16, maxIter=40)
GBT_model = gbt.fit(train_final_df)

predictions_GBT = GBT_model.transform(test_final_df)
predictions_GBT.cache()
evaluator = BinaryClassificationEvaluator(rawPredictionCol="prediction")
auroc = evaluator.evaluate(predictions_GBT, {evaluator.metricName: "areaUnderROC"})
aupr = evaluator.evaluate(predictions_GBT, {evaluator.metricName: "areaUnderPR"})

Run Code Online (Sandbox Code Playgroud)

使用CV/PG:

from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.classification import GBTClassifier

GBT_model = GBTClassifier()

paramGrid = ParamGridBuilder() \
    .addGrid(GBT_model.maxDepth, [2,4]) \
    .addGrid(GBT_model.maxBins, [2,4]) \
    .addGrid(GBT_model.maxIter, [10,20]) \
    .build()

evaluator = BinaryClassificationEvaluator(rawPredictionCol="prediction", metricName="areaUnderPR")

crossval = CrossValidator(estimator=GBT_model, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=5) 

cvModel = crossval.fit(train_final_df)

Run Code Online (Sandbox Code Playgroud)

我知道它与翻译有关但无法弄清楚:(a)我做错了什么或(b)如何解决这个故障

更新:我在SO Apache Spark聊天中被要求提供版本和内存配置,所以我想我会在这里提供更新.

版本:

Spark:2.0.1
齐柏林飞艇:0.6.2

内存配置:

我在一个EMR集群上运行,使用c1.xlarge EC2(7 GiB)实例作为我的主服务器,r3.8xlarge(244 GiB)作为我的核心节点
在Zeppelin中,我进入并将spark.driver.memory更改为4g并将spark.executor.memory更改为128g

在我进入并设置这些Zeppelin内存配置后,我再次运行我的代码并仍然得到相同的错误.

我刚刚开始使用Spark,是否还需要设置其他内存配置？这些内存配置不合理吗？

归档时间：	9 年，1 月前
查看次数：	923 次
最近记录：	9 年，1 月前