当Spark中的值为空时如何删除双引号？

Question

当Spark中的值为空时如何删除双引号？

Yoa*_*ria 5 python csv dataframe pyspark

使用 Spark 的 df.write.save() 方法在 S3 中注册我的 CSV 时，当值为空时，我想删除双引号 ""

火花版本：2.4.0

Python 版本：3.6.5

这是我在 Python 中加载 csv 文件的代码：

df = spark.read.load(
    path('in'),
    format = 'csv',
    delimiter = '|',
    encoding = 'utf-8',
    header = 'true'
)

Run Code Online (Sandbox Code Playgroud)

加载的 CSV 文件：

|id|first_name|last_name|zip_code|
|1 |          |Elsner   |57315   |
|2 |Noelle    |         |        |
|3 |James     |Moser    |48256   |

Run Code Online (Sandbox Code Playgroud)

这是我在 Python 中编写 csv 文件的代码：

|id|first_name|last_name|zip_code|
|1 |          |Elsner   |57315   |
|2 |Noelle    |         |        |
|3 |James     |Moser    |48256   |

Run Code Online (Sandbox Code Playgroud)

写入的 CSV 文件：

|id|first_name|last_name|zip_code|
|1 |""        |Elsner   |57315   |
|2 |Noelle    |""       |""      |
|3 |James     |Moser    |48256   |

Run Code Online (Sandbox Code Playgroud)

注册时如何去掉双引号？

非常感谢您提前。

Answer 1

har*_*ppu 5

根据Spark 文档，nullValueand的默认值emptyValue是None, and 将导致空字符串。要将其设置为实际nothing，如您所愿，您可以将其设置为 Unicode NULL：

df.write.save(
    path('out'),
    format = 'csv',
    delimiter = '|',
    header = True,
    nullValue = '\u0000',
    emptyValue = '\u0000'
)

Run Code Online (Sandbox Code Playgroud)

Answer 2

小智 5

如果您正在寻找 PySpark 的方法来执行此操作，请不要尝试使用空字符串技巧！它更加直接（一旦你知道了窍门......）

myDF.coalesce(1).write\
    .option("emptyValue", None)\
    .option("nullValue", None)\
    .csv(outFile)

Run Code Online (Sandbox Code Playgroud)

希望能帮助到你！在任何地方都找不到它的记录

Answer 3

Psi*_*dom 3

你的数据框中有空字符串，如果你想将它们写为空，可以将空字符串替换为null，然后nullValues=None在保存时设置：

df.replace('', None)              # replace empty string with null
  .write.save(
    path('out'), 
    format='csv', 
    delimiter='|', 
    header=True, 
    nullValue=None                # write null value as None
  )

Run Code Online (Sandbox Code Playgroud)

它将另存为：

id|first_name|last_name|zip_code
1||Elsner|57315
2|Noelle||
3|James|Moser|48256

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年，3 月前
查看次数：	6634 次
最近记录：	6 年，9 月前