根据列值从 csv 文件中删除行

rap*_*hui 4 awk text-processing csv

我有一个包含 1200 万行的 csv 文件,格式如下:

mcu_i,INIT,200,iFlash,  11593925,     88347,,0x00092684,r,0x4606b570,   ok,,         32,single,op-c,0,,         0,         0,         0,
mcu_i,INIT,200,iFlash,  11593931,     88348,,0x00092678,r,0x28003801,   ok,,         32,single,op-c,0,,         0,         0,         0,
Run Code Online (Sandbox Code Playgroud)

我想使用以下逻辑根据第六列的值删除行: if (value >= X AND value <= Y ) => 删除行

我使用 gawk 找到了一个解决方案:

gawk -i inplace -F ',' -v s="$start_marker" -v e="$end_marker" '!($6 <= e && $6 >= s)' myfile.csv
Run Code Online (Sandbox Code Playgroud)

但这需要太长时间,我想要另一个性能更好的解决方案。

谢谢

Rom*_*nov 5

一种可能的方法(通过重写命令)是:

gawk  -F, -v s="$start_marker" -v e="$end_marker" '$6 > e || $6 < s'  myfile.csv >/tmp/newfile
Run Code Online (Sandbox Code Playgroud)

在 中awk,不建议使用就地操作,它具有安全隐患。此外,在 100% 确定脚本正确之前,您可能会弄乱源文件。


avi*_*iro 5

长话短说

gawk将您的标准输出重定向到/dev/null或通过管道将其传输到cat将大大加速它并显着减少运行时间。

gawk -i inplace [...] myfile.csv >/dev/null
Run Code Online (Sandbox Code Playgroud)

或者:

gawk -i inplace [...] myfile.csv | cat
Run Code Online (Sandbox Code Playgroud)

潜入水中

虽然 @RomeoNinov 的答案确实比你原来的命令运行得更快,但我想解释一下为什么它更快,而且我的解决方案即使使用-i inplace.

如果您查看信息页面中的“交互式缓冲与非交互式缓冲”部分,您将看到:gawk

交互式程序通常对它们的输出进行行缓冲(即,它们写出每一行)。非交互式程序会等待直到缓冲区已满,这可能是多行输出。

即使结果没有打印gawk到标准输出,但当它打印到某个“就地”时,这似乎也是如此。

例子

我有一个 10 行的文件。

$ cat somefile
1
2
3
4
5
6
7
8
9
10
Run Code Online (Sandbox Code Playgroud)

默认情况下(不对文件进行任何更改,只是按原样打印回所有行),请注意,strace显示gawk运行 10 个write系统调用 - 原始文件中的每一行一个。

$ strace -e trace=write -c gawk -i inplace 1 somefile 
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.000098           9        10           write
------ ----------- ----------- --------- --------- ----------------
100.00    0.000098           9        10           total
Run Code Online (Sandbox Code Playgroud)

这是因为它是交互式运行,并且结果是行缓冲的gawk即使结果被写入文件而不是标准输出,也会在完成后立即打印每一行)。

Now, if I redirect stdout to /dev/null (or just pipe the command to a cat command) to make this command Noninteractive, strace shows that gawk only calls a single write system call. That's because it doesn't print every line immediately, but rather flush the result only once the buffer is full.

$ strace -e trace=write -c gawk -i inplace 1 somefile > /dev/null
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.000020          20         1           write
------ ----------- ----------- --------- --------- ----------------
100.00    0.000020          20         1           total
Run Code Online (Sandbox Code Playgroud)

This builds up of course, and the bigger your input file is, the larger the difference between interactive and non-interactive runs will be.

Summary

Your command is slow because gawk in interactive mode writes every line to the file once it finishes processing it. This means it performs millions of writes to the file.

@RomeoNinov's solution is faster than your original command because instead of using inplace, it redirects the output to a temporary files, thus it runs in non-interactive mode, which optimizes the buffer flushing and makes gawk perform less write operations to the file.

However, you can still use the command provided in your question, but just redirect its stdout to /dev/null (since it's empty anyway) or pipe it to cat, and it will run just as fast.

Security implications of using gawk with inplace

While I don't fully agree with @RomeoNinov comment that inplace operations might lead to unpredictable results, please notice @OlivierDulac's comment that provides a useful answer explaining why usually using -i inplace is considered a security vulnerability and how to workaround it to run it in a safe manner.