How to find the index of elements in a Pyspark RDD?

Question

How to find the index of elements in a Pyspark RDD?

Jer*_*rge 3 python indexing apache-spark rdd pyspark

This is my first question. I am coding in Pyspark. I have and RDD:

['a,b,c,d,e,f']

Run Code Online (Sandbox Code Playgroud)

How do I find the index of the element 'e'?

I tried zipWithIndex but its not giving me any index.

I saw a similar question, but the solution mentioned did not return me the index

rdd.zipWithIndex().filter(lambda key,index : key == 'e') \
    .map(lambda key,index : index).collect()

Run Code Online (Sandbox Code Playgroud)

I am getting an error.

Please let me know how to find the index.

Based on the solution provided:

I still have a problem. My rdd is in this format:

['a,b,c,d,e,f']

Run Code Online (Sandbox Code Playgroud)

So when I try :

rdd.zipWithIndex().lookup('e')

I get [ ]

How should I proceed

Thanks

Answer 1

hi-*_*zir 5

因为无论你得到一个异常map，并filter期待一个参数的函数：

rdd = sc.parallelize(['a', 'b', 'c', 'd', 'e', 'f'])

(rdd
    .zipWithIndex()
    .filter(lambda ki: ki[0] == 'e')
    .map(lambda ki : ki[1]))

# [4]

Run Code Online (Sandbox Code Playgroud)

在史前的Python版本中，元组解压缩也可以正常工作：

(rdd
    .zipWithIndex()
    .filter(lambda (key, index): key == 'e')
    .map(lambda (key, index): index))

Run Code Online (Sandbox Code Playgroud)

但我希望您不要使用任何这些。

我个人只会使用 lookup

rdd.zipWithIndex().lookup('e')
# [4]

Run Code Online (Sandbox Code Playgroud)

另外-请记住，RDD中的值顺序可能不确定。

归档时间：	7 年，7 月前
查看次数：	2298 次
最近记录：	7 年，7 月前