小编Jer*_*rge的帖子

How to find the index of elements in a Pyspark RDD?

This is my first question. I am coding in Pyspark. I have and RDD:

['a,b,c,d,e,f']
Run Code Online (Sandbox Code Playgroud)

How do I find the index of the element 'e'?

I tried zipWithIndex but its not giving me any index.

I saw a similar question, but the solution mentioned did not return me the index

rdd.zipWithIndex().filter(lambda key,index : key == 'e') \
    .map(lambda key,index : index).collect()
Run Code Online (Sandbox Code Playgroud)

I am getting an error.

Please let me know how to find the index.

Based on the solution provided: …

python indexing apache-spark rdd pyspark

3
推荐指数
1
解决办法
2298
查看次数

将Pyspark RDD拆分为不同的列并转换为Dataframe

我有一个rdd:

a,1,2,3,4
b,4,6
c,8,9,10,11
Run Code Online (Sandbox Code Playgroud)

我想将其转换为具有索引的Spark数据框:

df:

Index  Name  Number
 0      a     1,2,3,4
 1      b     4,6
 2      c     8,9,10,11
Run Code Online (Sandbox Code Playgroud)

我尝试拆分RDD:

parts = rdd.flatMap(lambda x: x.split(","))
Run Code Online (Sandbox Code Playgroud)

但结果是:

a,
1,
2,
3,...
Run Code Online (Sandbox Code Playgroud)

如何将RDD拆分并转换为pyspark中的Dataframe,使第一个元素作为第一列,其余元素合并为一列?

如解决方案中所述:

rd = rd1.map(lambda x: x.split("," , 1) ).zipWithIndex()
rd.take(3)
Run Code Online (Sandbox Code Playgroud)

输出:

[(['a', '1,2,3,4'], 0),
(['b', '4,6'], 1),
(['c', '8,9,10,11'], 2)]
Run Code Online (Sandbox Code Playgroud)

下一步:

rd2=rd2=rd.map(lambda x,y: (y, x[0] , x[1]) ).toDF(["index", "name" ,"number"])
rd2.collect()
Run Code Online (Sandbox Code Playgroud)

我收到以下错误:

 An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 0 in stage …
Run Code Online (Sandbox Code Playgroud)

python dataframe apache-spark rdd pyspark

2
推荐指数
1
解决办法
2402
查看次数

标签 统计

apache-spark ×2

pyspark ×2

python ×2

rdd ×2

dataframe ×1

indexing ×1