相关疑难解决方法(0)

在spark中计数和收集函数时抛出IllegalArgumentException

当我count()在PySpark中使用时,我试图在本地Spark上加载一个小数据集(take()似乎正常工作).我试图搜索这个问题,但没有弄清楚为什么.RDD的分区似乎有问题.有任何想法吗？先感谢您!

sc.stop()
sc = SparkContext("local[4]", "temp")
testfile1 = sc.textFile(localpath('part-00000-Copy1.xml'))
testfile1.filter(lambda x: x.strip().encode('utf-8').startswith(b'<row')).take(1) ## take function seems working

Run Code Online (Sandbox Code Playgroud)

这就是数据的样子:

['  <row AcceptedAnswerId="15" AnswerCount="5" Body="&lt;p&gt;How should I elicit prior distributions from experts when fitting a Bayesian model?&lt;/p&gt;&#10;" CommentCount="1" CreationDate="2010-07-19T19:12:12.510" FavoriteCount="17" Id="1" LastActivityDate="2010-09-15T21:08:26.077" OwnerUserId="8" PostTypeId="1" Score="26" Tags="&lt;bayesian&gt;&lt;prior&gt;&lt;elicitation&gt;" Title="Eliciting priors from experts" ViewCount="1457" />']

Run Code Online (Sandbox Code Playgroud)

这就是问题:

test1 = testfile1.filter(lambda x: (x.strip().encode('utf-8').startswith(b'<row'))).filter(lambda x: x is not None)
test1.count()

Run Code Online (Sandbox Code Playgroud)

这是例外:

    ---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-34-d7626ed81f56> in <module>()
----> 1 test1.count() …

Run Code Online (Sandbox Code Playgroud)

python macos apache-spark pyspark python-3.6

use*_*820

2018 01-16

6
推荐指数

2
解决办法

2300
查看次数