ack :获取第 10 个(或更大的第 n 个)匹配/捕获组

bba*_*025 5 perl ack pattern-matching

我想我可能刚刚搜索错了,但我没有找到任何答案。如果有重复,请告诉我,我可以将其删除。

问题背景

我正在使用ack链接),它在引擎盖下有 Perl 5,来获取 n-gram - 特别是高阶 n-gram。我可以使用我知道的语法(基本上最多$9)获得最多 9 克,但我无法获得 10 克。使用$10只是给了我$1一个0之后。之类的东西$(10),并${10}没有解决问题。我对使用语言建模工具包的解决方案感兴趣,我想使用ack.

我使用的一个数据集是马克吐温的全集

( wget http://www.gutenberg.org/cache/epub/3200/pg3200.txt && mv pg3200.txt TWAIN_Mark_complete_orig.txt).

我已经解析干净了(请参阅文章末尾的解析注释)并将解析结果保存为TWAIN_Mark_complete_parsed.txt.

我从 2 克中得到了很好的结果,代码和部分结果是

time cat TWAIN_Mark_complete_parsed.txt | \
    ack '(\S+) +(?=(\S+) +)' \
    --output '$1 $2' | \
    sort | uniq -c | \
    sort -rn > Twain_2grams.txt
## `time` info not shown
$ head -n 2 Twain_2grams.txt
  18176 of the
  13288 in the
Run Code Online (Sandbox Code Playgroud)

一直到 9 克,与

time cat TWAIN_Mark_complete_parsed.txt | \
    ack '(\S+) (?=(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+))' \
    --output '$1 $2 $3 $4 $5 $6 $7 $8 $9' | \
    sort | uniq -c | sort -rn > Twain_9grams.txt
## time info not shown
$ head -n 2 Twain_9grams.txt
     17 to mrs jane clemens and mrs moffett in st
     17 mrs jane clemens and mrs moffett in st louis
Run Code Online (Sandbox Code Playgroud)

(注意,我对ack命令进行元编程,而不仅仅是键入每个命令。)

问题/我的尝试

我第一次尝试 10 克,以及结果,是

time cat TWAIN_Mark_complete_parsed.txt | \
    ack '(\S+) (?=(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+))' \
    --output '$1 $2 $3 $4 $5 $6 $7 $8 $9 $10' | \
    sort | uniq -c | sort -rn > Twain_10grams.txt

$ head -n 2 Twain_10grams.txt
     17 to mrs jane clemens and mrs moffett in st to0
     17 mrs jane clemens and mrs moffett in st louis mrs0
Run Code Online (Sandbox Code Playgroud)

为了更好地了解正在发生的事情,

diff -u <(head -n 2 Twain_10grams.txt) <(head -n 2 Twain_9grams.txt) |  颜色差异|  差异突出显示

参见 这个 SO 答案(和这个评论)有关如何通过逐字差异突出显示来获得彩色差异的详细信息。基本上aptyumcolordiff,然后pipdiff-highlight

使用$(10)而不是$10将输出的前两行作为

     17 to mrs jane clemens and mrs moffett in st $(10)
     17 mrs jane clemens and mrs moffett in st louis $(10)
Run Code Online (Sandbox Code Playgroud)

(两分钟后)。

使用${10}而不是$10将输出的前两行作为

     17 to mrs jane clemens and mrs moffett in st ${10}
     17 mrs jane clemens and mrs moffett in st louis ${10}
Run Code Online (Sandbox Code Playgroud)

我的想法就到此为止。

预期/期望输出

注意这里一个统计(非常非零和有限的)实际输出是从这里显示的不同的可能性。9-gram 的前两个结果不是不同的单词序列。通过查看前 10 个最常见的 9-gram 可以找到更常见的 10-gram 的其他可能部分 - 使用head代替head -n 2。即便如此,我相当肯定,即使这样也不能保证我们有两个最频繁的 10 克。但是,我希望我已经足够清楚地说明了我想要完成的任务。

17 to mrs jane clemens and mrs moffett in st louis
3 mrs jane clemens and mrs moffett in st louis honolulu

编辑我已经发现另一组将预期输出更改为(可能不是实际输出,而是将其从我之前使用的简单模型更改的一组。)

     17 to mrs jane clemens and mrs moffett in st louis
      7 happiness in his home had been wounded and bruised almost
Run Code Online (Sandbox Code Playgroud)

那将是head -n 2我一直用来显示我得到什么样的结果的。

我不想通过我将在这里使用的相同过程获得它。

$ grep -o "to mrs jane clemens and mrs moffett in st [^ ]\+" \
   TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
     17 to mrs jane clemens and mrs moffett in st louis

$ grep -o "mrs jane clemens and mrs moffett in st louis [^ ]\+" \
   TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
      3 mrs jane clemens and mrs moffett in st louis honolulu
      2 mrs jane clemens and mrs moffett in st louis san
      2 mrs jane clemens and mrs moffett in st louis no
      2 mrs jane clemens and mrs moffett in st louis 224
      1 mrs jane clemens and mrs moffett in st louis wash
      1 mrs jane clemens and mrs moffett in st louis wailuku
      1 mrs jane clemens and mrs moffett in st louis virginia
      1 mrs jane clemens and mrs moffett in st louis the
      1 mrs jane clemens and mrs moffett in st louis sept
      1 mrs jane clemens and mrs moffett in st louis on
      1 mrs jane clemens and mrs moffett in st louis hartford
      1 mrs jane clemens and mrs moffett in st louis carson
Run Code Online (Sandbox Code Playgroud)

编辑用于查找较新的第二位频率的代码是

$ grep -o "[^ ]\+ happiness in his home had been wounded and bruised" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
      6 shelley's happiness in his home had been wounded and bruised
      1 his happiness in his home had been wounded and bruised
$ grep -o "shelley's happiness in his home had been wounded and [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
      6 shelley's happiness in his home had been wounded and bruised
$ grep -o "happiness in his home had been wounded and bruised [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
      7 happiness in his home had been wounded and bruised almost
$ grep -o "in his home had been wounded and bruised almost [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
      7 in his home had been wounded and bruised almost to
$ grep -o "his home had been wounded and bruised almost to [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
      7 his home had been wounded and bruised almost to death
$ grep -o "home had been wounded and bruised almost to death [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
      1 home had been wounded and bruised almost to death thirdly
      1 home had been wounded and bruised almost to death secondly
      1 home had been wounded and bruised almost to death it
      1 home had been wounded and bruised almost to death fourthly
      1 home had been wounded and bruised almost to death first
      1 home had been wounded and bruised almost to death fifthly
      1 home had been wounded and bruised almost to death and
Run Code Online (Sandbox Code Playgroud)

从评论中编辑

@Inian 发表了很棒的评论

这记录在发行说明中 - github.com/beyondgrep/ack3/blob/dev/RELEASE-NOTES.md -您现在仅限于以下变量:$1 到 $9、$、$.、$&、$` , $' 和 $+_

对于未来的人,我正在放一个版本,今天存档RELEASE-NOTES

man对页面ack确实有行

$1 through $9
The subpattern from the corresponding set of capturing parentheses.
If your pattern is "(.+) and (.+)", and the string is "this and that',
then $1 is "this" and $2 is "that".

但我希望有办法获得更高的数字。根据来自 的信息RELEASE-NOTES,这种希望似乎已基本消失。

但是,我仍然想知道是否有人有解决方法或技巧,无论是使用ack还是使用任何更“标准”的 *NIX 类型终端工具。我的偏好依次是perlgrepawksed。如果有类似的东西ack(即只是命令行解析,而不是基于 NLP 工具包的解决方案),我也对此感兴趣。

我认为最好将此作为一个新问题提出。如果你在这里回答,太好了。如果我最终发布了一个新问题,我会将链接放在这里:目前,这只是指向同一问题的链接


解析笔记

为了让我的语料库准备好进行 n-gram 分析,这是我的解析。

     17 to mrs jane clemens and mrs moffett in st $(10)
     17 mrs jane clemens and mrs moffett in st louis $(10)
Run Code Online (Sandbox Code Playgroud)

是的,这一切都可以在一行上(并且没有尾随 && :),但是这样可以更容易阅读并解释我为什么要做我正在做的事情。


系统详情

     17 to mrs jane clemens and mrs moffett in st ${10}
     17 mrs jane clemens and mrs moffett in st louis ${10}
Run Code Online (Sandbox Code Playgroud)