awk 日期条件

zoo*_*mer 1 sed awk csv

我正在尝试从 person.csv(如下)中删除行,条件是该人不是在过去 1 年出生的:

数据集1

"Index","User Id","First Name","Last Name","Date of birth","Job Title"
"1","9E39Bfc4fdcc44e","new, Diamond","Dudley","06 Dec 1945","Photographer"
"3","32C079F2Bad7e6F","Ethan","Hanson","08 Mar 2014","Actuary"
"2","aaaaaaa, bbbbbb","Grace","Huerta","21 Jan 2023","Visual merchandiser"
Run Code Online (Sandbox Code Playgroud)

因此,预期的输出如下所示(最后一行在不到一年的时间内被删除):

"Index","User Id","First Name","Last Name","Date of birth","Job Title"
"1","9E39Bfc4fdcc44e","new, Diamond","Dudley","06 Dec 1945","Photographer"
"3","32C079F2Bad7e6F","Ethan","Hanson","08 Mar 2014","Actuary"
Run Code Online (Sandbox Code Playgroud)

我尝试使用 awk 来执行以下操作:

awk -F , '{print $5 ....}' person.csv > output.csv
Run Code Online (Sandbox Code Playgroud)

但是,无法弄清楚如何将每行日期与(今天减去 1 年)进行比较。

Dataset2:有时双引号字段内可能有双引号,例如(line1 field4):

"Index","User Id","First Name","Last Name","Date of birth","Job Title"
"1","9E39Bfc4fdcc44e","new, Diamond","Dudley (aka "dud")","03 Oct 2023","Photographer"
"3","32C079F2Bad7e6F","Ethan","Hanson","03 Dec 2022","Actuary"
"2","aaaaaaa, bbbbbb","Grace","Huerta","21 Jan 2023","Visual merchandiser"
Run Code Online (Sandbox Code Playgroud)

如果“sed”可以做到这一点,我也持开放态度。请任何帮助,谢谢!

mar*_*uso 5

假设:

  • 所有列/字段都用双引号引起来
  • 双引号不会显示为数据的一部分(否则我们需要除基本字符之外的其他内容-F'"'作为字段分隔符)
  • OP(操作系统)date支持该-d参数(例如,如果“今天”16 Sep 2023在 OP 的系统上将date -d '-1 year' '+%Y%m%d'生成20220916
  • 由于OP提到截止日期可以是任何东西(例如,-1年,-7天等),我们将使用(操作系统)date以格式生成截止日期YYYYMMDD(否则我们需要添加更多代码awk来能够处理各种条件,如“-1年”、“-7天”等)

一种awk方法:

cutoff=$(date -d '-1 year' '+%Y%m%d')                             # change '-1 year' to the desired condition;
                                                                  # alternatively: manually set to the desired date (in YYYYMMDD format)

awk -v cutoff="${cutoff}" -F'"' '                                 # set awk variable "cutoff" to the value of the OS variable of the same name
                                                                  # field delimiter is double quotes; this means data fields are even-numbered (eg, 5th field is the 10th "-delimited field)
BEGIN { mlist="JanFebMarAprMayJunJulAugSepOctNovDec" }
NR>1  { split($10,a,/[[:space:]]+/)                               # split 5th data field on spaces; a[1]=day a[2]=month a[3]=year
        m=sprintf("%02d", ( (index(mlist,a[2])+2) /3) )           # convert 3-letter month to 2-digit month
        if ( a[3] m a[1] > cutoff) next                           # if new date is greater than the cutoff then skip to the next line of input
      } 
1                                                                 # print the current line
' person.csv
Run Code Online (Sandbox Code Playgroud)

这会生成:

"Index","User Id","First Name","Last Name","Date of birth","Job Title"
"1","9E39Bfc4fdcc44e","new, Diamond","Dudley","06 Dec 1945","Photographer"
"3","32C079F2Bad7e6F","Ethan","Hanson","08 Mar 2014","Actuary"
Run Code Online (Sandbox Code Playgroud)

性能角度...

此答案需要单个操作系统调用date,并需要 1 个文件描述符打开/关闭(如果将输出重定向到另一个文件,则为 2 个)。

dateGilles 的答案需要对每行输入进行操作系统调用,并且需要为每次调用打开/关闭文件描述符的昂贵开销date

测试运行:

100K line file          # per comment from OP
GNU awk 5.1.0
GNU date 8.32
Ubuntu 20.04
i7-1260P
Run Code Online (Sandbox Code Playgroud)

这个答案:

real    0m0.198s        <<< 546 times faster
user    0m0.198s
sys     0m0.000s
Run Code Online (Sandbox Code Playgroud)

吉尔斯的回答:

real    1m48.229s       <<<
user    1m30.598s
sys     0m23.999s
Run Code Online (Sandbox Code Playgroud)

两次运行的输出都保存到文件中;两个输出文件中的adiff显示没有差异(即,两个答案生成相同的结果集)。


在这种情况下,OP 声明所有字段都用双引号引起来。

在某些字段可能没有用双引号引起来的情况下,我们可以使用GNU awk's 'FPAT'并且仍然只执行对 的单个调用date,例如:

cutoff=$(date -d '-1 year' '+%Y%m%d')

awk -v cutoff="${cutoff}" '
BEGIN { FPAT="([^,]+)|(\"[^\"]+\")"
        mlist="JanFebMarAprMayJunJulAugSepOctNovDec"
      }
NR>1  { f5=$5
        gsub(/"/,"",f5)                                           # strip double quotes from 5th data field
        split(f5,a,/[[:space:]]+/)                                # change from 10th field to 5th field
        m=sprintf("%02d", ( (index(mlist,a[2])+2) /3) )
        if ( a[3] m a[1] > cutoff) next
      }
    1
' person.csv
Run Code Online (Sandbox Code Playgroud)

使用与上面相同的测试标准,这个答案的运行时间:

real    0m0.861s        <<<
user    0m0.850s
sys     0m0.009s
Run Code Online (Sandbox Code Playgroud)

FPAT基于(而不是)解析输入会使-F'"'运行时间增加约 4 倍,但仍然比 108 秒快得多。