我正在尝试从 person.csv(如下)中删除行,条件是该人不是在过去 1 年出生的:
数据集1:
"Index","User Id","First Name","Last Name","Date of birth","Job Title"
"1","9E39Bfc4fdcc44e","new, Diamond","Dudley","06 Dec 1945","Photographer"
"3","32C079F2Bad7e6F","Ethan","Hanson","08 Mar 2014","Actuary"
"2","aaaaaaa, bbbbbb","Grace","Huerta","21 Jan 2023","Visual merchandiser"
Run Code Online (Sandbox Code Playgroud)
因此,预期的输出如下所示(最后一行在不到一年的时间内被删除):
"Index","User Id","First Name","Last Name","Date of birth","Job Title"
"1","9E39Bfc4fdcc44e","new, Diamond","Dudley","06 Dec 1945","Photographer"
"3","32C079F2Bad7e6F","Ethan","Hanson","08 Mar 2014","Actuary"
Run Code Online (Sandbox Code Playgroud)
我尝试使用 awk 来执行以下操作:
awk -F , '{print $5 ....}' person.csv > output.csv
Run Code Online (Sandbox Code Playgroud)
但是,无法弄清楚如何将每行日期与(今天减去 1 年)进行比较。
Dataset2:有时双引号字段内可能有双引号,例如(line1 field4):
"Index","User Id","First Name","Last Name","Date of birth","Job Title"
"1","9E39Bfc4fdcc44e","new, Diamond","Dudley (aka "dud")","03 Oct 2023","Photographer"
"3","32C079F2Bad7e6F","Ethan","Hanson","03 Dec 2022","Actuary"
"2","aaaaaaa, bbbbbb","Grace","Huerta","21 Jan 2023","Visual merchandiser"
Run Code Online (Sandbox Code Playgroud)
如果“sed”可以做到这一点,我也持开放态度。请任何帮助,谢谢!
假设:
-F'"'
作为字段分隔符)date
支持该-d
参数(例如,如果“今天”16 Sep 2023
在 OP 的系统上将date -d '-1 year' '+%Y%m%d'
生成20220916
)date
以格式生成截止日期YYYYMMDD
(否则我们需要添加更多代码awk
来能够处理各种条件,如“-1年”、“-7天”等)一种awk
方法:
cutoff=$(date -d '-1 year' '+%Y%m%d') # change '-1 year' to the desired condition;
# alternatively: manually set to the desired date (in YYYYMMDD format)
awk -v cutoff="${cutoff}" -F'"' ' # set awk variable "cutoff" to the value of the OS variable of the same name
# field delimiter is double quotes; this means data fields are even-numbered (eg, 5th field is the 10th "-delimited field)
BEGIN { mlist="JanFebMarAprMayJunJulAugSepOctNovDec" }
NR>1 { split($10,a,/[[:space:]]+/) # split 5th data field on spaces; a[1]=day a[2]=month a[3]=year
m=sprintf("%02d", ( (index(mlist,a[2])+2) /3) ) # convert 3-letter month to 2-digit month
if ( a[3] m a[1] > cutoff) next # if new date is greater than the cutoff then skip to the next line of input
}
1 # print the current line
' person.csv
Run Code Online (Sandbox Code Playgroud)
这会生成:
"Index","User Id","First Name","Last Name","Date of birth","Job Title"
"1","9E39Bfc4fdcc44e","new, Diamond","Dudley","06 Dec 1945","Photographer"
"3","32C079F2Bad7e6F","Ethan","Hanson","08 Mar 2014","Actuary"
Run Code Online (Sandbox Code Playgroud)
性能角度...
此答案需要单个操作系统调用date
,并需要 1 个文件描述符打开/关闭(如果将输出重定向到另一个文件,则为 2 个)。
date
Gilles 的答案需要对每行输入进行操作系统调用,并且需要为每次调用打开/关闭文件描述符的昂贵开销date
。
测试运行:
100K line file # per comment from OP
GNU awk 5.1.0
GNU date 8.32
Ubuntu 20.04
i7-1260P
Run Code Online (Sandbox Code Playgroud)
这个答案:
real 0m0.198s <<< 546 times faster
user 0m0.198s
sys 0m0.000s
Run Code Online (Sandbox Code Playgroud)
吉尔斯的回答:
real 1m48.229s <<<
user 1m30.598s
sys 0m23.999s
Run Code Online (Sandbox Code Playgroud)
两次运行的输出都保存到文件中;两个输出文件中的adiff
显示没有差异(即,两个答案生成相同的结果集)。
在这种情况下,OP 声明所有字段都用双引号引起来。
在某些字段可能没有用双引号引起来的情况下,我们可以使用GNU awk's 'FPAT'
并且仍然只执行对 的单个调用date
,例如:
cutoff=$(date -d '-1 year' '+%Y%m%d')
awk -v cutoff="${cutoff}" '
BEGIN { FPAT="([^,]+)|(\"[^\"]+\")"
mlist="JanFebMarAprMayJunJulAugSepOctNovDec"
}
NR>1 { f5=$5
gsub(/"/,"",f5) # strip double quotes from 5th data field
split(f5,a,/[[:space:]]+/) # change from 10th field to 5th field
m=sprintf("%02d", ( (index(mlist,a[2])+2) /3) )
if ( a[3] m a[1] > cutoff) next
}
1
' person.csv
Run Code Online (Sandbox Code Playgroud)
使用与上面相同的测试标准,这个答案的运行时间:
real 0m0.861s <<<
user 0m0.850s
sys 0m0.009s
Run Code Online (Sandbox Code Playgroud)
FPAT
基于(而不是)解析输入会使-F'"'
运行时间增加约 4 倍,但仍然比 108 秒快得多。