我从一个关联数组和 2 个 awk 文件开始,我可以看到我的问题。如果有重复记录,我只得到 1 个结果。但我原以为结果会是斗牛犬覆盖梗犬。为什么我会得到这个结果?
我的 file1.tsv 是:
cat siamese
dog terrier
dog bulldog
snake python
Run Code Online (Sandbox Code Playgroud)
file2.tsv 是:
A barking dog never bites.
A cat has nine lives.
A bird in the hand is worth two in the bush.
Run Code Online (Sandbox Code Playgroud)
我当前的脚本是:
FILE1="./file1.tsv"
FILE2="./file2.tsv"
awk '
BEGIN {FS = OFS = "\t";}
NR == FNR {kw[$1] = $2; next}
{
n = split(tolower($1), words, /[[:blank:]]|\.|,/)
for (i = 1; i <= n; i++) {
if (words[i] in kw && length(words[i]) > 2) print kw[words[i]], $1, "PHRASE"
}
}
' $FILE1 $FILE2 | sort -t $'\t' -k1,1 -k2 > test.tsv
Run Code Online (Sandbox Code Playgroud)
当前输出为:
siamese A cat has nine lives. PHRASE
siamese A dog and a cat are friends. PHRASE
terrier A barking dog never bites. PHRASE
terrier A dog and a cat are friends. PHRASE
Run Code Online (Sandbox Code Playgroud)
但我正在寻找的是(我的顺序可能不正确,但缺少两只斗牛犬:
siamese A cat has nine lives. PHRASE
siamese A dog and a cat are friends. PHRASE
bulldog A dog and a cat are friends. PHRASE
bulldog A barking dog never bites. PHRASE
terrier A barking dog never bites. PHRASE
terrier A dog and a cat are friends. PHRASE
Run Code Online (Sandbox Code Playgroud)
我是不是用 awk 找错了树?我怎样才能实现这个目标?
感谢您在这里的任何帮助。
这个答案专门使用 GNU awk 来处理数组的数组:
gawk '
NR == FNR {animal[$1][$2]; next}
{
for (species in animal)
if ($0 ~ species)
for (type in animal[species])
print type, $0, "PHRASE"
}
' file1.tsv file2.tsv
Run Code Online (Sandbox Code Playgroud)
哪个产生
bulldog A barking dog never bites. PHRASE
terrier A barking dog never bites. PHRASE
siamese A cat has nine lives. PHRASE
siamese A dog and a cat are friends. PHRASE
bulldog A dog and a cat are friends. PHRASE
terrier A dog and a cat are friends. PHRASE
Run Code Online (Sandbox Code Playgroud)
给定 file2.tsv 包含
A barking dog never bites.
A cat has nine lives.
A dog and a cat are friends.
Run Code Online (Sandbox Code Playgroud)
为了匹配单词“dog”或“cat”,我们可以使用单词边界使正则表达式更加精确:
if (tolower($0) ~ "\\<" species "\\>")
Run Code Online (Sandbox Code Playgroud)
这样,您就不会匹配如下行:
if (tolower($0) ~ "\\<" species "\\>")
Run Code Online (Sandbox Code Playgroud)