das*_*s-g 5 apache-spark apache-spark-sql pyspark pyspark-sql databricks
在Databricks"Community Edition"的Python笔记本中,我正在试验旧金山市开放数据,该数据是关于911请求消防员的紧急呼叫.(旧的2016年数据副本"使用Apache Spark 2.0分析旧金山市的开放数据"(YouTube),并在S3上提供该教程.)
在挂载数据并使用显式定义的模式将其读入DataFrame之后fire_service_calls_df,我将该DataFrame别名为SQL表:
sqlContext.registerDataFrameAsTable(fire_service_calls_df, "fireServiceCalls")
Run Code Online (Sandbox Code Playgroud)
使用它和DataFrame API,我可以计算发生的调用类型:
fire_service_calls_df.select('CallType').distinct().count()
Run Code Online (Sandbox Code Playgroud)
Run Code Online (Sandbox Code Playgroud)Out[n]: 34
...或Python中的SQL:
spark.sql("""
SELECT count(DISTINCT CallType)
FROM fireServiceCalls
""").show()
Run Code Online (Sandbox Code Playgroud)
Run Code Online (Sandbox Code Playgroud)+------------------------+ |count(DISTINCT CallType)| +------------------------+ | 33| +------------------------+
...或SQL单元格:
%sql
SELECT count(DISTINCT CallType)
FROM fireServiceCalls
Run Code Online (Sandbox Code Playgroud)
回答问题
Spark SQL无法正确计数或者我无法正确编写SQL吗?
从标题看:我无法正确编写SQL。
编写SQL的规则<插入数字>:考虑NULL和UNDEFINED。
%sql
SELECT count(*)
FROM (
SELECT DISTINCT CallType
FROM fireServiceCalls
)
Run Code Online (Sandbox Code Playgroud)
34
另外,我显然无法阅读:
保罗在评论中建议
只有 30 个左右的值,您只需排序并打印所有不同的项目即可查看差异在哪里。
嗯,其实我自己也是这么想的。(减去排序。)不过,没有任何区别,输出中始终有 34 种调用类型,无论我是使用 SQL 还是 DataFrame 查询生成它。我只是没有注意到其中一个的名字是不祥的null:
Run Code Online (Sandbox Code Playgroud)+--------------------------------------------+ |CallType | +--------------------------------------------+ |Elevator / Escalator Rescue | |Marine Fire | |Aircraft Emergency | |Confined Space / Structure Collapse | |Administrative | |Alarms | |Odor (Strange / Unknown) | |Lightning Strike (Investigation) | |null | |Citizen Assist / Service Call | |HazMat | |Watercraft in Distress | |Explosion | |Oil Spill | |Vehicle Fire | |Suspicious Package | |Train / Rail Fire | |Extrication / Entrapped (Machinery, Vehicle)| |Other | |Transfer | |Outside Fire | |Traffic Collision | |Assist Police | |Gas Leak (Natural and LP Gases) | |Water Rescue | |Electrical Hazard | |High Angle Rescue | |Structure Fire | |Industrial Accidents | |Medical Incident | |Mutual Aid / Assist Outside Agency | |Fuel Spill | |Smoke Investigation (Outside) | |Train / Rail Incident | +--------------------------------------------+
| 归档时间: |
|
| 查看次数: |
92 次 |
| 最近记录: |