Nas*_*din 6 python filter apache-spark spark-dataframe
我创建了一个具有以下架构的数据框:
In [43]: yelp_df.printSchema()
root
|-- business_id: string (nullable = true)
|-- cool: integer (nullable = true)
|-- date: string (nullable = true)
|-- funny: integer (nullable = true)
|-- id: string (nullable = true)
|-- stars: integer (nullable = true)
|-- text: string (nullable = true)
|-- type: string (nullable = true)
|-- useful: integer (nullable = true)
|-- user_id: string (nullable = true)
|-- name: string (nullable = true)
|-- full_address: string (nullable = true)
|-- latitude: double (nullable = true)
|-- longitude: double (nullable = true)
|-- neighborhoods: string (nullable = true)
|-- open: boolean (nullable = true)
|-- review_count: integer (nullable = true)
|-- state: string (nullable = true)
Run Code Online (Sandbox Code Playgroud)
现在我想只选择"打开"列为"true"的记录.如下图所示,很多都是"开放的".
business_id cool date funny id stars text type useful user_id name full_address latitude longitude neighborhoods open review_count state
9yKzy9PApeiPPOUJE... 2 2011-01-26 0 fWKvX83p0-ka4JS3d... 4 My wife took me h... business 5 rLtl8ZkDX5vH5nAx9... Morning Glory Cafe 6106 S 32nd St Ph... 33.3907928467 -112.012504578 [] true 116 AZ
ZRJwVLyzEJq1VAihD... 0 2011-07-27 0 IjZ33sJrzXqU-0X6U... 4 I have no idea wh... business 0 0a2KyEL0d3Yb1V6ai... Spinato's Pizzeria 4848 E Chandler B... 33.305606842 -111.978759766 [] true 102 AZ
6oRAC4uyJCsJl1X0W... 0 2012-06-14 0 IESLBzqUCLdSzSqm0... 4 love the gyro pla... business 1 0hT2KtfLiobPvh6cD... Haji-Baba 1513 E Apache Bl... 33.4143447876 -111.913032532 [] true 265 AZ
_1QQZuf4zZOyFCvXc... 1 2010-05-27 0 G-WvGaISbqqaMHlNn... 4 Rosie, Dakota, an... business 2 uZetl9T0NcROGOyFf... Chaparral Dog Park 5401 N Hayden Rd ... 33.5229454041 -111.90788269 [] true 88 AZ
6ozycU1RpktNG2-1B... 0 2012-01-05 0 1uJFq2r5QfJG_6ExM... 4 General Manager S... business 0 vYmM4KTsC8ZfQBg-j... Discount Tire 1357 S Power Road... 33.3910255432 -111.68447876 [] true 5 AZ
Run Code Online (Sandbox Code Playgroud)
但是我在pyspark中运行的以下命令不返回任何内容:
yelp_df.filter(yelp_df["open"] == "true").collect()
Run Code Online (Sandbox Code Playgroud)
做正确的方法是什么?
X_T*_*ust 16
from pyspark.sql import functions as F
filtered_df = df.filter(F.col('my_bool_col'))
Run Code Online (Sandbox Code Playgroud)
Aks*_*jan 12
您正在错误地比较数据类型.open被列为布尔值,而不是字符串,所以做yelp_df["open"] == "true"的不正确 - "true"是一个字符串.
相反,你想做
yelp_df.filter(yelp_df["open"] == True).collect()
Run Code Online (Sandbox Code Playgroud)
这正确地比较了open布尔基元的值True,而不是非布尔字符串"true".
| 归档时间: |
|
| 查看次数: |
15782 次 |
| 最近记录: |