awk 日期条件

Question

awk 日期条件

我正在尝试从 person.csv（如下）中删除行，条件是该人不是在过去 1 年出生的：

数据集1：

"Index","User Id","First Name","Last Name","Date of birth","Job Title"
"1","9E39Bfc4fdcc44e","new, Diamond","Dudley","06 Dec 1945","Photographer"
"3","32C079F2Bad7e6F","Ethan","Hanson","08 Mar 2014","Actuary"
"2","aaaaaaa, bbbbbb","Grace","Huerta","21 Jan 2023","Visual merchandiser"

Run Code Online (Sandbox Code Playgroud)

因此，预期的输出如下所示（最后一行在不到一年的时间内被删除）：

"Index","User Id","First Name","Last Name","Date of birth","Job Title"
"1","9E39Bfc4fdcc44e","new, Diamond","Dudley","06 Dec 1945","Photographer"
"3","32C079F2Bad7e6F","Ethan","Hanson","08 Mar 2014","Actuary"

Run Code Online (Sandbox Code Playgroud)

我尝试使用 awk 来执行以下操作：

awk -F , '{print $5 ....}' person.csv > output.csv

Run Code Online (Sandbox Code Playgroud)

但是，无法弄清楚如何将每行日期与（今天减去 1 年）进行比较。

Dataset2：有时双引号字段内可能有双引号，例如（line1 field4）：

"Index","User Id","First Name","Last Name","Date of birth","Job Title"
"1","9E39Bfc4fdcc44e","new, Diamond","Dudley (aka "dud")","03 Oct 2023","Photographer"
"3","32C079F2Bad7e6F","Ethan","Hanson","03 Dec 2022","Actuary"
"2","aaaaaaa, bbbbbb","Grace","Huerta","21 Jan 2023","Visual merchandiser"

Run Code Online (Sandbox Code Playgroud)

如果“sed”可以做到这一点，我也持开放态度。请任何帮助，谢谢！

Answer 1

mar*_*uso 5

假设：

所有列/字段都用双引号引起来
双引号不会显示为数据的一部分（否则我们需要除基本字符之外的其他内容-F'"'作为字段分隔符）
OP（操作系统）date支持该-d参数（例如，如果“今天”16 Sep 2023在 OP 的系统上将date -d '-1 year' '+%Y%m%d'生成20220916）
由于OP提到截止日期可以是任何东西（例如，-1年，-7天等），我们将使用（操作系统）date以格式生成截止日期YYYYMMDD（否则我们需要添加更多代码awk来能够处理各种条件，如“-1年”、“-7天”等）

一种awk方法：

cutoff=$(date -d '-1 year' '+%Y%m%d')                             # change '-1 year' to the desired condition;
                                                                  # alternatively: manually set to the desired date (in YYYYMMDD format)

awk -v cutoff="${cutoff}" -F'"' '                                 # set awk variable "cutoff" to the value of the OS variable of the same name
                                                                  # field delimiter is double quotes; this means data fields are even-numbered (eg, 5th field is the 10th "-delimited field)
BEGIN { mlist="JanFebMarAprMayJunJulAugSepOctNovDec" }
NR>1  { split($10,a,/[[:space:]]+/)                               # split 5th data field on spaces; a[1]=day a[2]=month a[3]=year
        m=sprintf("%02d", ( (index(mlist,a[2])+2) /3) )           # convert 3-letter month to 2-digit month
        if ( a[3] m a[1] > cutoff) next                           # if new date is greater than the cutoff then skip to the next line of input
      } 
1                                                                 # print the current line
' person.csv

Run Code Online (Sandbox Code Playgroud)

这会生成：

"Index","User Id","First Name","Last Name","Date of birth","Job Title"
"1","9E39Bfc4fdcc44e","new, Diamond","Dudley","06 Dec 1945","Photographer"
"3","32C079F2Bad7e6F","Ethan","Hanson","08 Mar 2014","Actuary"

Run Code Online (Sandbox Code Playgroud)

性能角度...

此答案需要单个操作系统调用date，并需要 1 个文件描述符打开/关闭（如果将输出重定向到另一个文件，则为 2 个）。

dateGilles 的答案需要对每行输入进行操作系统调用，并且需要为每次调用打开/关闭文件描述符的昂贵开销date。

测试运行：

100K line file          # per comment from OP
GNU awk 5.1.0
GNU date 8.32
Ubuntu 20.04
i7-1260P

Run Code Online (Sandbox Code Playgroud)

这个答案：

real    0m0.198s        <<< 546 times faster
user    0m0.198s
sys     0m0.000s

Run Code Online (Sandbox Code Playgroud)

吉尔斯的回答：

real    1m48.229s       <<<
user    1m30.598s
sys     0m23.999s

Run Code Online (Sandbox Code Playgroud)

两次运行的输出都保存到文件中；两个输出文件中的adiff显示没有差异（即，两个答案生成相同的结果集）。

在这种情况下，OP 声明所有字段都用双引号引起来。

在某些字段可能没有用双引号引起来的情况下，我们可以使用GNU awk's 'FPAT'并且仍然只执行对的单个调用date，例如：

cutoff=$(date -d '-1 year' '+%Y%m%d')

awk -v cutoff="${cutoff}" '
BEGIN { FPAT="([^,]+)|(\"[^\"]+\")"
        mlist="JanFebMarAprMayJunJulAugSepOctNovDec"
      }
NR>1  { f5=$5
        gsub(/"/,"",f5)                                           # strip double quotes from 5th data field
        split(f5,a,/[[:space:]]+/)                                # change from 10th field to 5th field
        m=sprintf("%02d", ( (index(mlist,a[2])+2) /3) )
        if ( a[3] m a[1] > cutoff) next
      }
    1
' person.csv

Run Code Online (Sandbox Code Playgroud)

使用与上面相同的测试标准，这个答案的运行时间：

real    0m0.861s        <<<
user    0m0.850s
sys     0m0.009s

Run Code Online (Sandbox Code Playgroud)

FPAT基于（而不是）解析输入会使-F'"'运行时间增加约 4 倍，但仍然比 108 秒快得多。

归档时间：	2 年，2 月前
查看次数：	760 次
最近记录：	2 年，1 月前