sed - 删除大型csv文件中引号内的引号

nol*_*nol 5 regex csv sed

我正在使用流编辑器sed将大量文本文件数据(400MB)转换为csv格式.

我已经非常接近完成,但突出的问题是引号内的引号,对于这样的数据:

1,word1,"description for word1","another text",""text contains "double quotes" some more text"
2,word2,"description for word2","another text","text may not contain double quotes, but may contain commas ,"
3,word3,"description for "word3"","another text","more text and more"
Run Code Online (Sandbox Code Playgroud)

所需的输出是:

1,word1,"description for word1","another text","text contains double quotes some more text"
2,word2,"description for word2","another text","text may not contain double quotes, but may contain commas ,"
3,word3,"description for word3","another text","more text and more"
Run Code Online (Sandbox Code Playgroud)

我一直在寻找帮助,但我没有太接近解决方案,我尝试了以下seds与正则表达式模式:

sed -i 's/(?<!^\s*|,)""(?!,""|\s*$)//g' *.txt
sed -i 's/(?<=[^,])"(?=[^,])//g' *.txt
Run Code Online (Sandbox Code Playgroud)

这些来自以下问题,但似乎不适用于sed:

与perl相关的问题

SISS的相关问题

原始文件是*.txt,我正在尝试用sed编辑它们.

Ste*_*eve 2

这是使用FPATGNU awk变量的一种方法:

gawk 'BEGIN { FPAT="([^,]+)|(\"[^\"]+\")"; OFS=","; N="\"" } { for (i=1;i<=NF;i++) if ($i ~ /^\".*\"$/) { gsub(/\"/,"", $i); $i=N $i N } }1' file
Run Code Online (Sandbox Code Playgroud)

结果:

1,word1,"description for word1","another text","text contains double
quotes some more text" 2,word2,"description for word2","another
text","text may not contain double quotes, but may contain commas ,"
3,word3,"description for word3","another text","more text and more"
Run Code Online (Sandbox Code Playgroud)

解释:

使用 FPAT,字段被定义为“任何不是逗号的内容”或“双引号、任何不是双引号的内容以及结束双引号”。然后,在输入的每一行上,循环遍历每个字段,如果该字段以双引号开头和结尾,则从该字段中删除所有引号。最后,在该字段周围添加双引号。