列在pySpark中不可迭代

Question

列在pySpark中不可迭代

tod*_*ysm 2 apache-spark apache-spark-sql pyspark spark-dataframe

所以，我们有点困惑。在Jupyter Notebook中，我们具有以下数据框：

+--------------------+--------------+-------------+--------------------+--------+-------------------+ 
|          created_at|created_at_int|  screen_name|            hashtags|ht_count|     single_hashtag|
+--------------------+--------------+-------------+--------------------+--------+-------------------+
|2017-03-05 00:00:...|    1488672001|     texanraj|  [containers, cool]|       1|         containers|
|2017-03-05 00:00:...|    1488672001|     texanraj|  [containers, cool]|       1|               cool|
|2017-03-05 00:00:...|    1488672002|   hubskihose|[automation, future]|       1|         automation|
|2017-03-05 00:00:...|    1488672002|   hubskihose|[automation, future]|       1|             future|
|2017-03-05 00:00:...|    1488672002|    IBMDevOps|            [DevOps]|       1|             devops|
|2017-03-05 00:00:...|    1488672003|SoumitraKJana|[VoiceOfWipro, Cl...|       1|       voiceofwipro|
|2017-03-05 00:00:...|    1488672003|SoumitraKJana|[VoiceOfWipro, Cl...|       1|              cloud|
|2017-03-05 00:00:...|    1488672003|SoumitraKJana|[VoiceOfWipro, Cl...|       1|             leader|
|2017-03-05 00:00:...|    1488672003|SoumitraKJana|      [Cloud, Cloud]|       1|              cloud|
|2017-03-05 00:00:...|    1488672003|SoumitraKJana|      [Cloud, Cloud]|       1|              cloud|
|2017-03-05 00:00:...|    1488672004|SoumitraKJana|[VoiceOfWipro, Cl...|       1|       voiceofwipro|
|2017-03-05 00:00:...|    1488672004|SoumitraKJana|[VoiceOfWipro, Cl...|       1|              cloud|
|2017-03-05 00:00:...|    1488672004|SoumitraKJana|[VoiceOfWipro, Cl...|       1|managedfiletransfer|
|2017-03-05 00:00:...|    1488672004|SoumitraKJana|[VoiceOfWipro, Cl...|       1|         asaservice|
|2017-03-05 00:00:...|    1488672004|SoumitraKJana|[VoiceOfWipro, Cl...|       1|   interconnect2017|
|2017-03-05 00:00:...|    1488672004|SoumitraKJana|[VoiceOfWipro, Cl...|       1|                hmi|
|2017-03-05 00:00:...|    1488672005|SoumitraKJana|[Cloud, ManagedFi...|       1|              cloud|
|2017-03-05 00:00:...|    1488672005|SoumitraKJana|[Cloud, ManagedFi...|       1|managedfiletransfer|
|2017-03-05 00:00:...|    1488672005|SoumitraKJana|[Cloud, ManagedFi...|       1|         asaservice|
|2017-03-05 00:00:...|    1488672005|SoumitraKJana|[Cloud, ManagedFi...|       1|   interconnect2017|
+--------------------+--------------+-------------+--------------------+--------+-------------------+
only showing top 20 rows

root
 |-- created_at: timestamp (nullable = true)
 |-- created_at_int: integer (nullable = true)
 |-- screen_name: string (nullable = true)
 |-- hashtags: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- ht_count: integer (nullable = true)
 |-- single_hashtag: string (nullable = true)

Run Code Online (Sandbox Code Playgroud)

我们正在尝试获取每小时的主题标签计数。我们采用的方法是使用Window进行分区single_hashtag。像这样：

# create WindowSpec                                 
hashtags_24_winspec = Window.partitionBy(hashtags_24.single_hashtag). \  
            orderBy(hashtags_24.created_at_int).rangeBetween(-3600, 3600)

Run Code Online (Sandbox Code Playgroud)

但是，当我们尝试ht_count使用以下方法求和时：

#sum_count_over_time = sum(hashtags_24.ht_count).over(hashtags_24_winspec)

Run Code Online (Sandbox Code Playgroud)

我们得到以下错误：

Column is not iterable
Traceback (most recent call last):
  File "/usr/hdp/current/spark2-client/python/pyspark/sql/column.py", line 240, in __iter__
    raise TypeError("Column is not iterable")
TypeError: Column is not iterable

Run Code Online (Sandbox Code Playgroud)

错误消息不是很有用，我们很困惑，确切地调查了哪一列。有任何想法吗？

Answer 1

小智 5

您使用的是错误的sum：

from pyspark.sql.functions import sum

sum_count_over_time = sum(hashtags_24.ht_count).over(hashtags_24_winspec)

Run Code Online (Sandbox Code Playgroud)

实际上，您可能需要别名或包导入：

from pyspark.sql.functions import sum as sql_sum

# or

from pyspark.sql.functions as F
F.sum(...)

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，10 月前
查看次数：	5245 次
最近记录：	8 年，10 月前