Pyspark:使用具有多种Critiria的repartitionAndSortWithinPartitions

Question

Pyspark:使用具有多种Critiria的repartitionAndSortWithinPartitions

假设我有以下RDD:

rdd = sc.parallelize([('a', (5,1)), ('d', (8,2)), ('2', (6,3)), ('a', (8,2)), ('d', (9,6)), ('b', (3,4)),('c', (8,3))])

Run Code Online (Sandbox Code Playgroud)

如何使用repartitionAndSortWithinPartitionsx [0]和x [1] [0]后排序.使用以下内容我只按键(x [0])排序:

Npartitions = sc.defaultParallelism
rdd2 = rdd.repartitionAndSortWithinPartitions(2, lambda x: hash(x) % Npartitions, 2)

Run Code Online (Sandbox Code Playgroud)

这样做的方法如下,但我认为应该更简单:

Npartitions = sc.defaultParallelism 
partitioned_data = rdd
  .partitionBy(2)
  .map(lambda x:(x[0],x[1][0],x[1][1]))
  .toDF(['letter','number2','number3'])
  .sortWithinPartitions(['letter','number2'],ascending=False)
  .map(lambda x:(x.letter,(x.number2,x.number3)))

>>> partitioned_data.glom().collect()

[[],
[(u'd', (9, 6)), (u'd', (8, 2))],
[(u'c', (8, 3)), (u'c', (6, 3))],
[(u'b', (3, 4))],
[(u'a', (8, 2)), (u'a', (5, 1))]

Run Code Online (Sandbox Code Playgroud)

可以看出,我必须将其转换为Dataframe才能使用sortWithinPartitions.还有另外一种方法吗？用repartitionAndSortWIthinPartitions？

(数据不是全局排序并不重要.我只关心在分区内进行排序.)

Answer 1

zer*_*323 10

这是可能的,但您必须在复合键中包含所有必需的信息:

from pyspark.rdd import portable_hash

n = 2

def partitioner(n):
    """Partition by the first item in the key tuple"""
    def partitioner_(x):
        return portable_hash(x[0]) % n
    return partitioner_


(rdd
  .keyBy(lambda kv: (kv[0], kv[1][0]))  # Create temporary composite key
  .repartitionAndSortWithinPartitions(
      numPartitions=n, partitionFunc=partitioner(n), ascending=False)
  .map(lambda x: x[1]))  # Drop key (note: there is no partitioner set anymore)

Run Code Online (Sandbox Code Playgroud)

逐步说明:

keyBy(lambda kv: (kv[0], kv[1][0]))创建一个替换键,它由原始键和值的第一个元素组成.换句话说,它转换:
```
(0, (5,1))
```
Run Code Online (Sandbox Code Playgroud)
成
```
((0, 5), (0, (5, 1)))
```
Run Code Online (Sandbox Code Playgroud)
实际上,简单地重塑数据可能会稍微高效一些
```
((0, 5), 1)
```
Run Code Online (Sandbox Code Playgroud)
partitioner 基于键的第一个元素的哈希定义分区函数,因此:
```
partitioner(7)((0, 5))
## 0

partitioner(7)((0, 6))
## 0

partitioner(7)((0, 99))
## 0

partitioner(7)((3, 99))
## 3
```
Run Code Online (Sandbox Code Playgroud)
你可以看到它是一致的,忽略了第二位.
我们使用default keyfunc(lambda x: x)函数,它依赖于Python上定义的字典顺序tuple:
```
(0, 5) < (1, 5)
## True

(0, 5) < (0, 4)
## False
```
Run Code Online (Sandbox Code Playgroud)

如前所述,您可以重塑数据:

rdd.map(lambda kv: ((kv[0], kv[1][0]), kv[1][1]))

Run Code Online (Sandbox Code Playgroud)

并放弃最终map以提高性能.

归档时间：	9 年，3 月前
查看次数：	3113 次
最近记录：	9 年，3 月前