无论模式是否为多行，如何仅获取 pdf 文件中模式的页码？

Question

无论模式是否为多行，如何仅获取 pdf 文件中模式的页码？

Tim*_*Tim 2 grep awk pdf text-processing pdfgrep

我在 pdf 文件中找到多行模式的页码，通过如何在 pdf 文件和文本文件中 grep 多行模式？以及如何在 pdf 文件中搜索字符串，并找到该字符串出现的每个页面的物理页码？

$ pdfgrep -Pn '(?s)image\s+?not\s+?available'  main_text.pdf 
49: image
   not
available
51: image
   not
available
53: image
   not
available
54: image
   not
available
55: image
   not
available

Run Code Online (Sandbox Code Playgroud)

我只想提取页码，但因为模式是多行的，我得到

$ pdfgrep -Pn '(?s)image\s+?not\s+?available'  main_text.pdf | awk -F":" '{print $1}'
49
   not
available
51
   not
available
53
   not
available
54
   not
available
55
   not
available

Run Code Online (Sandbox Code Playgroud)

代替

Run Code Online (Sandbox Code Playgroud)

我想知道如何仅提取页码，而不管模式是否为多行？谢谢。

Answer 1

ste*_*ver 5

这有点 hacky，但是由于您已经在使用与 perl 兼容的 RE，您可以使用\K“keep left”修饰符来匹配表达式中的所有内容（以及直到下一行结束的任何其他内容），但将其从输出中排除：

pdfgrep -Pn '(?s)image\s+?not\s+?available.*?$\K'  main_text.pdf

Run Code Online (Sandbox Code Playgroud)

但是，输出仍将包含:分隔符。

归档时间：	7 年，9 月前
查看次数：	674 次
最近记录：	7 年，9 月前