如何在数据帧spark中的列中获取列表的长度?

yan*_*hen 15 pyspark

我有一个df,其"产品"列是如下列表:

+----------+---------+--------------------+
|member_srl|click_day|            products|
+----------+---------+--------------------+
|        12| 20161223|  [2407, 5400021771]|
|        12| 20161226|        [7320, 2407]|
|        12| 20170104|              [2407]|
|        12| 20170106|              [2407]|
|        27| 20170104|        [2405, 2407]|
|        28| 20161212|              [2407]|
|        28| 20161213|      [2407, 100093]|
|        28| 20161215|           [1956119]|
|        28| 20161219|      [2407, 100093]|
|        28| 20161229|           [7905970]|
|       124| 20161011|        [5400021771]|
|      6963| 20160101|         [103825645]|
|      6963| 20160104|[3000014912, 6626...|
|      6963| 20160111|[99643224, 106032...|
Run Code Online (Sandbox Code Playgroud)

如何添加列表product_cnt长度的新列products?以及如何过滤df以获得具有给定产品长度条件的指定行?谢谢.

Dav*_*yne 12

Pyspark具有内置功能,可以实现您想要的功能size.http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.size.要将其添加为列,您只需在select语句中调用它即可.

from pyspark.sql.functions import size

countdf = df.select('*',size('products').alias('product_cnt'))
Run Code Online (Sandbox Code Playgroud)

过滤与@ titiro89描述的完全一样.此外,您可以使用size过滤器中的功能.这将允许您以下列方式绕过添加额外列(如果您希望这样做).

filterdf = df.filter(size('products')==given_products_length)
Run Code Online (Sandbox Code Playgroud)


tit*_*o89 7

第一个问题

如何添加新列product_cnt属于产品长度列表?

>>> a = [(12,20161223, [2407,5400021771]),(12,20161226,[7320,2407])]
>>> df = spark.createDataFrame(a,
["member_srl","click_day","products"])
>>> df.show()
+----------+---------+------------------+
|member_srl|click_day|          products|
+----------+---------+------------------+
|        12| 20161223|[2407, 5400021771]|
|        12| 20161226|[7320, 2407, 4344]|
+----------+---------+------------------+
Run Code Online (Sandbox Code Playgroud)

您可以在此处找到类似的示例

>>> from pyspark.sql.types import IntegerType
>>> from pyspark.sql.functions import udf

>>> slen = udf(lambda s: len(s), IntegerType())

>>> df2 = df.withColumn("product_cnt", slen(df.products))
>>> df2.show()
+----------+---------+------------------+-----------+
|member_srl|click_day|          products|product_cnt|
+----------+---------+------------------+-----------+
|        12| 20161223|[2407, 5400021771]|          2|
|        12| 20161226|[7320, 2407, 4344]|          3|
+----------+---------+------------------+-----------+
Run Code Online (Sandbox Code Playgroud)

第二个问题

以及如何过滤df以获取给定产品长度条件的指定行?

您可以在此处使用过滤器功能文档

>>> givenLength = 2
>>> df3 = df2.filter(df2.product_cnt==givenLength)
>>> df3.show()
+----------+---------+------------------+-----------+
|member_srl|click_day|          products|product_cnt|
+----------+---------+------------------+-----------+
|        12| 20161223|[2407, 5400021771]|          2|
+----------+---------+------------------+-----------+
Run Code Online (Sandbox Code Playgroud)