This is my first question. I am coding in Pyspark. I have and RDD:
['a,b,c,d,e,f']
Run Code Online (Sandbox Code Playgroud)
How do I find the index of the element 'e'?
I tried zipWithIndex but its not giving me any index.
I saw a similar question, but the solution mentioned did not return me the index
rdd.zipWithIndex().filter(lambda key,index : key == 'e') \
.map(lambda key,index : index).collect()
Run Code Online (Sandbox Code Playgroud)
I am getting an error.
Please let me know how to find the index.
Based on the solution provided: …
我有一个rdd:
a,1,2,3,4
b,4,6
c,8,9,10,11
Run Code Online (Sandbox Code Playgroud)
我想将其转换为具有索引的Spark数据框:
df:
Index Name Number
0 a 1,2,3,4
1 b 4,6
2 c 8,9,10,11
Run Code Online (Sandbox Code Playgroud)
我尝试拆分RDD:
parts = rdd.flatMap(lambda x: x.split(","))
Run Code Online (Sandbox Code Playgroud)
但结果是:
a,
1,
2,
3,...
Run Code Online (Sandbox Code Playgroud)
如何将RDD拆分并转换为pyspark中的Dataframe,使第一个元素作为第一列,其余元素合并为一列?
如解决方案中所述:
rd = rd1.map(lambda x: x.split("," , 1) ).zipWithIndex()
rd.take(3)
Run Code Online (Sandbox Code Playgroud)
输出:
[(['a', '1,2,3,4'], 0),
(['b', '4,6'], 1),
(['c', '8,9,10,11'], 2)]
Run Code Online (Sandbox Code Playgroud)
下一步:
rd2=rd2=rd.map(lambda x,y: (y, x[0] , x[1]) ).toDF(["index", "name" ,"number"])
rd2.collect()
Run Code Online (Sandbox Code Playgroud)
我收到以下错误:
An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage …Run Code Online (Sandbox Code Playgroud)