从 CSV 读取 Pyspark 中的字符串数组作为数组

Question

从 CSV 读取 Pyspark 中的字符串数组作为数组

Har*_*pta 3 apache-spark apache-spark-sql pyspark

我有一个csv包含这样数据的文件

ID|Arr_of_Str
 1|["ABC DEF"]
 2|["PQR", "ABC DEF"]

Run Code Online (Sandbox Code Playgroud)

我想读取这个.csv文件，但是当我使用时sqlContext.read.load，它将它作为字符串读取

当前的：

df.printSchema()

root
 |-- ID: integer (nullable = true)
 |-- Arr_of_Str: string (nullable = true)

Run Code Online (Sandbox Code Playgroud)

预期的：

df.printSchema()

root
 |-- ID: integer (nullable = true)
 |-- Arr_of_Str: array (nullable = true)
      |-- element: string (containsNull = true)

Run Code Online (Sandbox Code Playgroud)

如何将字符串转换为字符串数组？

Answer 1

bla*_*hop 5

更新：

实际上，您可以简单地使用from_json将Arr_of_Str列解析为字符串数组：

from pyspark.sql import functions as F

df2 = df.withColumn(
    "Arr_of_Str",
    F.from_json(F.col("Arr_of_Str"), "array<string>")
)

df1.show(truncate=False)

#+---+--------------+
#|ID |Arr_of_Str    |
#+---+--------------+
#| 1 |[ABC DEF]     |
#| 2 |[PQR, ABC DEF]|
#+---+--------------+

Run Code Online (Sandbox Code Playgroud)

旧答案：

读取数据时不能这样做，因为 CSV 不支持复杂的数据结构。加载 DataFrame 后，您必须进行转换。

只需从字符串中删除数组方括号并将其拆分即可获得数组列。

from pyspark.sql.functions import split, regexp_replace

df2 = df.withColumn("Arr_of_Str", split(regexp_replace(col("Arr_of_Str"), '[\\[\\]]', ""), ","))

df2.show()

+---+-------------------+
| ID|         Arr_of_Str|
+---+-------------------+
|  1|        ["ABC DEF"]|
|  2|["PQR",  "ABC DEF"]|
+---+-------------------+

df2.printSchema()

root
 |-- ID: string (nullable = true)
 |-- Arr_of_Str: array (nullable = true)
 |    |-- element: string (containsNull = true)

Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年前
查看次数：	2834 次
最近记录：	3 年，11 月前