如何计算包含两个单词之一但不包含两个单词的行

Question

如何计算包含两个单词之一但不包含两个单词的行

我需要计算包含单词the和an文本文件 ( poem.txt) 的行，但不计算同时包含.

我试过使用

grep -c the poem.txt | grep -c an poem.txt

Run Code Online (Sandbox Code Playgroud)

但是这给了我6个错误的答案时的总数the和an是9行。

我确实想计算包含单词的行而不是单词本身。只有实际的单词应该算数，所以the但不是there，an但不是Pan。

示例文件： poem.txt

Where is the misty shark?
Where is she?
The small reef roughly fights the mast.
Where is the small gull?
Where is he?
The gull grows like a clear pirate.
Clouds fall like old mainlands.

She will Rise calmly like a dead pirate.
Eat an orange.
Warm, sunny sharks quietly pull a cold, old breeze.
All ships command rough, rainy sails.

Elvis Aaron Presley also known simply as the Elvis
He is also referred to as the King
The best-selling solo music artist of all time
He was the most commercially successful artist in many genres

He has many awards including a Grammy lifetime achievement
Elvis in the 1970s has numerous jumpsuits including an eagle one.

Run Code Online (Sandbox Code Playgroud)

进一步澄清：诗中有多少行包含 the或an但您不应该计算同时包含the和的行an。

the car is red - this counted
an apple is in the corner - not counted
hello i am big - not counted
where is an apple - counted

Run Code Online (Sandbox Code Playgroud)

所以这里的输出应该是 2。

编辑：我不担心区分大小写。

最终编辑：感谢您的帮助。我设法解决了这个问题。我使用了答案之一并对其进行了一些更改。我使用 cat poem.txt | grep -Evi -e '\<an .* the\>' -e '\<the .* an\>' | grep -Eci -e '\<(an|the)\>过我如何-c将第二个 grep 中的更改为 a-n以获取一些附加信息。再次感谢大家的帮助！！:)

Answer 1

row*_*oat 10

perl -nE 'END {say $c+0} ++$c if /\bthe\b/i xor /\ban\b/i' file

Run Code Online (Sandbox Code Playgroud)

gawk 'END {print c+0} /\<the\>/ != /\<an\>/ {++c}' IGNORECASE=1 file

Run Code Online (Sandbox Code Playgroud)

比较匹配每个表达式的结果可以得到您想要的结果。

比如匹配的结果\<the\>可能是0 ，也可能是1。如果另一个匹配的结果是一样的，那么两个regexp要么都找到了，要么没找到，那行就不算。如果它们不同，则意味着找到了一个匹配项而另一个没有找到，因此计数器增加。

gawk 有一个内置xor()函数：

gawk 'END {print c+0} xor(/\<the\>/,/\<an\>/) {++c}' IGNORECASE=1 file

Run Code Online (Sandbox Code Playgroud)

评论不用于扩展讨论；此对话已[移至聊天](https://chat.stackexchange.com/rooms/119376/discussion-on-answer-by-rowboat-how-to-count-the-lines- contains-one-of-两件套）。 (2认同)

Answer 2

the*_*lic 8

使用 grep：

cat poem.txt \
  | grep -Evi -e '\<an\>.*\<the\>' -e '\<the\>.*\<an\>' \
  | grep -Eci -e '\<(an|the)\>'

Run Code Online (Sandbox Code Playgroud)

这会计算匹配的行数。您可以找到一种替代语法，它可以计算下面的匹配总数。

分解：

frist grep 命令过滤掉所有包含“an”和“the”的行。第二个 grep 命令计算那些包含“an”或“the”的行。

如果您c从第二个 grep 中删除-Eci，您将看到所有匹配项都突出显示。

细节：

该-E选项为 grep 启用扩展表达式语法 (ERE)。
该-i选项告诉 grep 匹配不区分大小写
该-v选项告诉 grep 反转结果（即匹配不包含模式的行）
该-c选项告诉 grep 输出匹配的行数而不是行本身
图案：
1. \<匹配单词的开头（感谢@glenn-jackman）
2. \>匹配单词的结尾（感谢@glenn-jackman）
--> 这样我们就可以确保不匹配包含'the' 或 'an' 的单词（例如 'pan'）
1. grep -Evi -e '\<an\>.*\<the\>'因此匹配所有不包含“an ... the”的行
2. 同样，grep -Evi -e '\<the\>.*\<an\>'匹配所有不包含 'the ... an' 的行
3. grep -Evi -e '\<an\>.*\<the\>' -e '\<the.*an\>' 是 3. 和 4. 的组合。
4. grep -Eci -e '\<(an|the)\>' 匹配包含“an”或“the”的所有行（由空格或行首/行尾包围）并打印匹配行的数量

编辑 1：使用\<and\>代替( |^)and ( |$)，正如@glenn-jackman 所建议的

编辑 2：为了计算匹配数而不是匹配行数，请使用以下表达式：

cat poem.txt \
  | grep -Evi -e '\<an\>.*\<the\>' -e '\<the\>.*\<an\>' \
  | grep -Eio -e '\<(an|the)\>' \
  | wc -l

Run Code Online (Sandbox Code Playgroud)

这使用了-ogrep 选项，它将每个匹配项打印在单独的行中（没有其他任何内容），然后wc -l计算行数。

另外为什么使用 cat 为什么不使用 grep 的文件名？ (6认同)
我建议使用 `\<` 而不是 `(^| )` 和 `\>` 而不是 `( |$)` -- 尖括号是 [GNU 扩展正则表达式](https:/ /www.gnu.org/software/gnulib/manual/html_node/egrep-regular-expression-syntax.html）。毕竟，非单词字符可能与空间不同。 (4认同)

Answer 3

Adm*_*Bee 5

以下 GNUawk程序应该可以解决问题：

awk '(/(^|\W)[Tt]he(\W|$)/ && !/(^|\W)[Aa]n(\W|$)/) || (/(^|\W)[Aa]n(\W|$)/ && !/(^|\W)[Tt]he(\W|$)/) {c++} END{print c}' poem.txt

Run Code Online (Sandbox Code Playgroud)

这将增加计数器c，如果

行匹配(^|\W)[Tt]he(\W|$)（首字母不区分大小写the，前面是非单词成分 ( \W) 或行首 ( ^)，后跟非单词成分 ( \W) 或行尾 ( $)）但不匹配（(^|\W)[Aa]n(\W|$)孤立的第一个- 字母不区分大小写an) - 或 -
该行匹配(^|\W)[Aa]n(\W|$)但不匹配(^|\W)[Tt]he(\W|$)

最后，打印的值c。

可以使用\<和\>为“词首”和“词尾”将其表述得更短：

awk '(/\<[Tt]he\>/ && !/\<[Aa]n\>/) || (/\<[Aa]n\>/ && !/\<[Tt]he\>/) {c++} END{print c}' poem.txt

Run Code Online (Sandbox Code Playgroud)

更短的是：

awk '/\<[Tt]he\>/ != /\<[Aa]n\>/ {c++} END{print c}' poem.txt

Run Code Online (Sandbox Code Playgroud)

因为不等式只在任何一个都为真，但不是同时（也没有）和an和the都出现在一条线上。

这种方法需要 GNU，awk因为\W和\</\>构造是扩展正则表达式语法的 GNU 扩展（但\</\>也被BSD正则表达式理解）。

请注意，您在自己尝试的解决方案中显示的管道构造将不起作用，因为grep使用文件作为输入参数调用会取代从 stdin 中读取，因此管道的第一部分将在不被注意的情况下消失，输出完全是由于最后一部分（查找的出现an，甚至是嵌入的那些）。

归档时间：	4 年，7 月前
查看次数：	1068 次
最近记录：	4 年，7 月前