her*_*err 15 regex csv bash sed
我现在已经坚持了几个小时,并通过各种不同的工具循环来完成工作.没有成功.如果有人可以帮我解决这个问题,那真是太棒了.
这是问题所在:
我有一个非常大的CSV文件(400mb +)格式不正确.现在它看起来像这样:
This is a long abstract describing something. What follows is the tile for this sentence." ,Title1 This is another sentence that is running on one line. On the next line you can find the title. ,Title2
你可能会看到标题",Title1"和",Title2"实际上应与前面的句子在同一行.然后它看起来像这样:
This is a long abstract describing something. What follows is the tile for this sentence.",Title1 This is another sentence that is running on one line. On the next line you can find the title.,Title2
请注意,句子的结尾可以包含引号.最后,他们也应该被替换.
这是我到目前为止提出的:
sed -n '1h;1!H;${;g;s/\."?.*,//g;p;}' out.csv > out1.csv
Run Code Online (Sandbox Code Playgroud)
这应该实际上完成了将表达式与多行匹配的工作.不幸的是它没有:)
表达式正在查找句子末尾的点和可选引号以及我想要匹配的换行符.*.
非常感谢.什么工具完成工作并不重要(awk,perl,sed,tr等).
谢谢,克里斯
Sie*_*geX 18
多线输入sed本身并不一定是棘手的,只是它使用大多数人不熟悉的命令并且具有某些副作用,例如当你使用'N'时用'\n'来区分下一行的当前行将下一行附加到模式空间.
无论如何,如果你在以逗号开头的行上匹配来决定是否删除换行符会更容易,这就是我在这里所做的:
sed 'N;/\n,/s/"\? *\n//;P;D' title_csv
Run Code Online (Sandbox Code Playgroud)
$ cat title_csv
don't touch this line
don't touch this line either
This is a long abstract describing something. What follows is the tile for this sentence."
,Title1
seriously, don't touch this line
This is another sentence that is running on one line. On the next line you can find the title.
,Title2
also, don't touch this line
Run Code Online (Sandbox Code Playgroud)
$ sed 'N;/\n,/s/"\? *\n//;P;D' title_csv
don't touch this line
don't touch this line either
This is a long abstract describing something. What follows is the tile for this sentence.,Title1
seriously, don't touch this line
This is another sentence that is running on one line. On the next line you can find the title.,Title2
also, don't touch this line
Run Code Online (Sandbox Code Playgroud)
Pau*_*ce. 13
你的工作有几个小的变化:
sed -n '1h;1!H;${;g;s/\."\?\n,//g;p;}' inputfile
Run Code Online (Sandbox Code Playgroud)
在?需要进行转义和.不匹配换行符.
这是另一种不需要使用保持空间的方法:
sed -n '${p;q};N;/\n,/{s/"\?\n//p;b};P;D' inputfile
Run Code Online (Sandbox Code Playgroud)
这是一个评论版本:
sed -n '
$ # for the last input line
{
p; # print
q # and quit
};
N; # otherwise, append the next line
/\n,/ # if it starts with a comma
{
s/"\?\n//p; # delete an optional comma and the newline and print the result
b # branch to the end to read the next line
};
P; # it doesn't start with a comma so print it
D # delete the first line of the pair (it's just been printed) and loop to the top
' inputfile
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
42476 次 |
| 最近记录: |