将 Awk 与关联数组和重复记录结合使用的建议或替代方案

Boy*_*oyd 3 awk

我从一个关联数组和 2 个 awk 文件开始,我可以看到我的问题。如果有重复记录,我只得到 1 个结果。但我原以为结果会是斗牛犬覆盖梗犬。为什么我会得到这个结果?

我的 file1.tsv 是:

cat siamese
dog terrier
dog bulldog
snake python
Run Code Online (Sandbox Code Playgroud)

file2.tsv 是:

A barking dog never bites.
A cat has nine lives.
A bird in the hand is worth two in the bush.
Run Code Online (Sandbox Code Playgroud)

我当前的脚本是:

FILE1="./file1.tsv"
FILE2="./file2.tsv"
awk '
  BEGIN {FS = OFS = "\t";}
  NR == FNR {kw[$1] = $2; next}
  {
    n = split(tolower($1), words, /[[:blank:]]|\.|,/)
    for (i = 1; i <= n; i++) {
      if (words[i] in kw && length(words[i]) > 2) print kw[words[i]], $1, "PHRASE"
      }
    }
' $FILE1 $FILE2 | sort -t $'\t' -k1,1 -k2  > test.tsv
Run Code Online (Sandbox Code Playgroud)

当前输出为:

siamese A cat has nine lives.   PHRASE
siamese A dog and a cat are friends.    PHRASE
terrier A barking dog never bites.  PHRASE
terrier A dog and a cat are friends.    PHRASE
Run Code Online (Sandbox Code Playgroud)

但我正在寻找的是(我的顺序可能不正确,但缺少两只斗牛犬:

siamese A cat has nine lives.   PHRASE
siamese A dog and a cat are friends.    PHRASE
bulldog A dog and a cat are friends.    PHRASE
bulldog A barking dog never bites.  PHRASE
terrier A barking dog never bites.  PHRASE
terrier A dog and a cat are friends.    PHRASE
Run Code Online (Sandbox Code Playgroud)

我是不是用 awk 找错了树?我怎样才能实现这个目标?

感谢您在这里的任何帮助。

gle*_*man 7

这个答案专门使用 GNU awk 来处理数组的数组:

gawk '
  NR == FNR {animal[$1][$2]; next}
  {
    for (species in animal)
      if ($0 ~ species)
        for (type in animal[species])
          print type, $0, "PHRASE"
  }
' file1.tsv file2.tsv
Run Code Online (Sandbox Code Playgroud)

哪个产生

bulldog A barking dog never bites. PHRASE
terrier A barking dog never bites. PHRASE
siamese A cat has nine lives. PHRASE
siamese A dog and a cat are friends. PHRASE
bulldog A dog and a cat are friends. PHRASE
terrier A dog and a cat are friends. PHRASE
Run Code Online (Sandbox Code Playgroud)

给定 file2.tsv 包含

A barking dog never bites.
A cat has nine lives.
A dog and a cat are friends.
Run Code Online (Sandbox Code Playgroud)

为了匹配单词“dog”或“cat”,我们可以使用单词边界使正则表达式更加精确:

      if (tolower($0) ~ "\\<" species "\\>")
Run Code Online (Sandbox Code Playgroud)

这样,您就不会匹配如下行:

      if (tolower($0) ~ "\\<" species "\\>")
Run Code Online (Sandbox Code Playgroud)