Gow*_*008 1 python apache-spark pyspark
我有以下源文件.我的文件中有一个名为" john" 的名称想要拆分为列表['j','o','h','n'].请按以下方式查找人员档案.
源文件:
id,name,class,start_data,end_date
1,john,xii,20170909,20210909
Run Code Online (Sandbox Code Playgroud)
码:
from pyspark.sql import SparkSession
def main():
spark = SparkSession.builder.appName("PersonProcessing").getOrCreate()
df = spark.read.csv('person.txt', header=True)
nameList = [x['name'] for x in df.rdd.collect()]
print(list(nameList))
df.show()
if __name__ == '__main__':
main()
Run Code Online (Sandbox Code Playgroud)
实际产量:
[u'john']
Run Code Online (Sandbox Code Playgroud)
期望的输出:
['j','o','h','n']
Run Code Online (Sandbox Code Playgroud)
如果你想在python中:
nameList = [c for x in df.rdd.collect() for c in x['name']]
Run Code Online (Sandbox Code Playgroud)
或者如果你想在spark中这样做:
from pyspark.sql import functions as F
df.withColumn('name', F.split(F.col('name'), '')).show()
Run Code Online (Sandbox Code Playgroud)
结果:
+---+--------------+-----+----------+--------+
| id| name|class|start_data|end_date|
+---+--------------+-----+----------+--------+
| 1|[j, o, h, n, ]| xii| 20170909|20210909|
+---+--------------+-----+----------+--------+
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
115 次 |
| 最近记录: |