当我count()在PySpark中使用时,我试图在本地Spark上加载一个小数据集(take()似乎正常工作).我试图搜索这个问题,但没有弄清楚为什么.RDD的分区似乎有问题.有任何想法吗?先感谢您!
sc.stop()
sc = SparkContext("local[4]", "temp")
testfile1 = sc.textFile(localpath('part-00000-Copy1.xml'))
testfile1.filter(lambda x: x.strip().encode('utf-8').startswith(b'<row')).take(1) ## take function seems working
Run Code Online (Sandbox Code Playgroud)
这就是数据的样子:
[' <row AcceptedAnswerId="15" AnswerCount="5" Body="<p>How should I elicit prior distributions from experts when fitting a Bayesian model?</p> " CommentCount="1" CreationDate="2010-07-19T19:12:12.510" FavoriteCount="17" Id="1" LastActivityDate="2010-09-15T21:08:26.077" OwnerUserId="8" PostTypeId="1" Score="26" Tags="<bayesian><prior><elicitation>" Title="Eliciting priors from experts" ViewCount="1457" />']
Run Code Online (Sandbox Code Playgroud)
这就是问题:
test1 = testfile1.filter(lambda x: (x.strip().encode('utf-8').startswith(b'<row'))).filter(lambda x: x is not None)
test1.count()
Run Code Online (Sandbox Code Playgroud)
这是例外:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-34-d7626ed81f56> in <module>()
----> 1 test1.count() …Run Code Online (Sandbox Code Playgroud)