Der*_*ski 5 python apache-spark pyspark apache-zeppelin apache-spark-mllib
我正在使用Zeppelin笔记本/ Apache Spark,我经常收到以下错误:
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)中的org.apache.thrift.transport.TTransportException位于org.apache.thrift.transport.TTransport.readAll(TTransport.java:86).位于org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)的apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol. java:219)org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)atg.apache.zeppelin.interpreter.thrift.RemoteInterpreterService $ Client.recv_interpret(RemoteInterpreterService.java:249)org.apache.zeppelin .interpreter.thrift.RemoteInterpreterService $ Client.interpret(RemoteInterpreterService.java:233)atg.apache.zeppelin.interpreter.remote.RemoteInterpreter.interpret(RemoteInterpreter.java:269)org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret (LazyOpenInterpreter.java:94)org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:279)org.apache.zeppelin.scheduler.Job.run(Job.java:176)at org.apache.zeppelin.scheduler.RemoteScheduler $ JobRunner.run(RemoteScheduler.java:328)at java .util.concurrent.Executors $ RunnableAdapter.call(Executors.java:511)at java.util.concurrent.FutureTask.run(FutureTask.java:266)at java.util.concurrent.ScheduledThreadPoolExecutor $ ScheduledFutureTask.access $ 201(ScheduledThreadPoolExecutor. java:180)at java.util.concurrent.ScheduledThreadPoolExecutor $ ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)at java.util.concurrent.ThreadPoolExecutor $ Worker .run(ThreadPoolExecutor.java:617)在java.lang.Thread.run(Thread.java:745)
如果我尝试再次运行相同的代码(只是忽略错误),我得到这个(只是顶线):
java.net.SocketException:管道损坏(写入失败)
然后,如果我尝试第三次(或任何时间后)运行它,我会收到此错误:
java.net.ConnectException:连接被拒绝(连接被拒绝)
如果我在Zeppelin Notebooks中重新启动解释器,那么它(最初)可以工作但最终我最终再次收到此错误.
在我的过程中的各个步骤(数据清理,矢量化等)中发生了这个错误,但是它最常出现的时间(到目前为止)是我正在拟合模型的时候.为了让您更好地了解我实际在做什么以及通常何时发生,我将引导您完成我的过程:
我正在使用Apache Spark ML并完成了一些标准的矢量化,加权等(CountVectorizer,IDF),然后在该数据上构建模型.
我使用VectorAssember创建我的特征向量,将其转换为密集向量,并将其转换为数据帧:
assembler = VectorAssembler(inputCols = ["fileSize", "hour", "day", "month", "punct_title", "cap_title", "punct_excerpt", "title_tfidf", "ct_tfidf", "excerpt_tfidf", "regex_tfidf"], outputCol="features")
vector_train = assembler.transform(train_raw).select("Target", "features")
vector_test = assembler.transform(test_raw).select("Target", "features")
train_final = vector_train.rdd.map(lambda x: Row(label=x[0],features=DenseVector(x[1].toArray())))
test_final = vector_test.rdd.map(lambda x: Row(label=x[0],features=DenseVector(x[1].toArray())))
train_final_df = sqlContext.createDataFrame(train_final)
test_final_df = sqlContext.createDataFrame(test_final)
Run Code Online (Sandbox Code Playgroud)
因此,进入模型的训练集看起来像这样(实际数据集有~15k列,我下采样到~5k示例只是为了试图让它运行):
[Row(features = DenseVector([7016.0,9.0,16.0,2.0,2.0,4.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0 0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.315,0.0,0.0,0.0,0.0, 0.0,0.0,0.0,0.0,..................... 0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0 ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.7,35,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0 ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],标签= 0)]
下一步是拟合模型,这是错误通常弹出的地方.我已经尝试了适合单个模型和运行CV(w/ParamGrid):
单一型号:
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
gbt = GBTClassifier(labelCol="label", featuresCol="features", maxDepth=8, maxBins=16, maxIter=40)
GBT_model = gbt.fit(train_final_df)
predictions_GBT = GBT_model.transform(test_final_df)
predictions_GBT.cache()
evaluator = BinaryClassificationEvaluator(rawPredictionCol="prediction")
auroc = evaluator.evaluate(predictions_GBT, {evaluator.metricName: "areaUnderROC"})
aupr = evaluator.evaluate(predictions_GBT, {evaluator.metricName: "areaUnderPR"})
Run Code Online (Sandbox Code Playgroud)
使用CV/PG:
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.classification import GBTClassifier
GBT_model = GBTClassifier()
paramGrid = ParamGridBuilder() \
.addGrid(GBT_model.maxDepth, [2,4]) \
.addGrid(GBT_model.maxBins, [2,4]) \
.addGrid(GBT_model.maxIter, [10,20]) \
.build()
evaluator = BinaryClassificationEvaluator(rawPredictionCol="prediction", metricName="areaUnderPR")
crossval = CrossValidator(estimator=GBT_model, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=5)
cvModel = crossval.fit(train_final_df)
Run Code Online (Sandbox Code Playgroud)
我知道它与翻译有关但无法弄清楚:(a)我做错了什么或(b)如何解决这个故障
更新:我在SO Apache Spark聊天中被要求提供版本和内存配置,所以我想我会在这里提供更新.
版本:
内存配置:
在我进入并设置这些Zeppelin内存配置后,我再次运行我的代码并仍然得到相同的错误.
我刚刚开始使用Spark,是否还需要设置其他内存配置?这些内存配置不合理吗?
| 归档时间: |
|
| 查看次数: |
923 次 |
| 最近记录: |