VIG*_*H R 1 apache-spark apache-spark-sql pyspark azure-databricks
我正在 databricks 中尝试这个。请让我知道需要导入的 pyspark 库以及在 Azure databricks pyspark 中获取以下输出的代码
示例:- 输入数据框:-
| column1 | column2 | column3 | column4 |
| a | bbbbb | cc | >dddddddd |
| >aaaaaaaaaaaaaa | bb | c | dddd |
| aa | >bbbbbbbbbbbb | >ccccccc | ddddd |
| aaaaa | bbbb | ccc | d |
Run Code Online (Sandbox Code Playgroud)
输出数据帧:-
| column | maxLength |
| column1 | 14 |
| column2 | 12 |
| column3 | 7 |
| column4 | 8 |
Run Code Online (Sandbox Code Playgroud)
小智 13
>>> from pyspark.sql import functions as sf
>>> df = sc.parallelize([['a','bbbbb','ccc','ddd'],['aaaa','bbb','ccccccc', 'dddd']]).toDF(["column1", "column2", "column3", "column4"])
>>> df1 = df.select([sf.length(col).alias(col) for col in df.columns])
>>> df1.groupby().max().show()
+------------+------------+------------+------------+
|max(column1)|max(column2)|max(column3)|max(column4)|
+------------+------------+------------+------------+
| 4| 5| 7| 4|
+------------+------------+------------+------------+
Run Code Online (Sandbox Code Playgroud)
然后使用此链接来融化以前的数据框
编辑:(从迭代每列并找到最大长度)
单行选择
from pyspark.sql.functions import col, length, max
df=df.select([max(length(col(name))).alias(name) for name in df.schema.names])
Run Code Online (Sandbox Code Playgroud)
作为行
df = df.select([max(length(col(name))).alias(name) for name in df.schema.names])
row=df.first().asDict()
df2 = spark.createDataFrame([Row(col=name, length=row[name]) for name in df.schema.names], ['col', 'length'])
Run Code Online (Sandbox Code Playgroud)
输出:
| 归档时间: |
|
| 查看次数: |
11434 次 |
| 最近记录: |