Spark 中将字符串拆分为字符数组

Zyg*_*ygD 1 arrays split apache-spark apache-spark-sql pyspark

如何将字符串列拆分为字符数组?

输入:

from pyspark.sql import functions as F
df = spark.createDataFrame([('Vilnius',), ('Riga',), ('Tallinn',), ('New York',)], ['col_cities'])
df.show()
# +----------+
# |col_cities|
# +----------+
# |   Vilnius|
# |      Riga|
# |   Tallinn|
# |  New York|
# +----------+
Run Code Online (Sandbox Code Playgroud)

期望的输出:

# +----------+------------------------+
# |col_cities|split                   |
# +----------+------------------------+
# |Vilnius   |[V, i, l, n, i, u, s]   |
# |Riga      |[R, i, g, a]            |
# |Tallinn   |[T, a, l, l, i, n, n]   |
# |New York  |[N, e, w,  , Y, o, r, k]|
# +----------+------------------------+
Run Code Online (Sandbox Code Playgroud)

Shu*_*rma 5

split您可以与具有负前瞻的正则表达式模式一起使用:

df.withColumn('split', F.split('col_cities', '(?!$)'))
Run Code Online (Sandbox Code Playgroud)
+----------+------------------------+
|col_cities|split                   |
+----------+------------------------+
|Vilnius   |[V, i, l, n, i, u, s]   |
|Riga      |[R, i, g, a]            |
|Tallinn   |[T, a, l, l, i, n, n]   |
|New York  |[N, e, w,  , Y, o, r, k]|
+----------+------------------------+
Run Code Online (Sandbox Code Playgroud)