使用sed在没有已知分隔符的情况下在行中提取多个匹配项

Question

使用sed在没有已知分隔符的情况下在行中提取多个匹配项

我有一个大文本文件,其中包含嵌入句子中的概率.我想只提取那些概率和它们之前的文本.例

输入:

not interesting
foo is 1 in 1,200 and test is 1 in 3.4 not interesting
something else is 1 in 2.5, things are 1 in 10
also not interesting

Run Code Online (Sandbox Code Playgroud)

通缉输出:

foo is 1/1,200
and test is 1/3.4
something else is 1/2.5,
things are 1/10

Run Code Online (Sandbox Code Playgroud)

到目前为止我所拥有的:

$ sed -nr ':a s|(.*) 1 in ([0-9.,]+)|\1 1/\2\n|;tx;by; :x h;ba; :y g;/^$/d; p' input

foo is 1/1,200
 and test is 1/3.4
 not interesting
something else is 1/2.5,
 things are 1/10

something else is 1/2.5,
 things are 1/10

Run Code Online (Sandbox Code Playgroud)

这个漂亮的代码在匹配时重复地分割行,并且如果它包含匹配则尝试仅打印它.我的代码的问题似乎是在一行完成后没有清除保持空间.

一般的问题是sed不能进行非贪婪匹配,我的分隔符可以是任何东西.

我想用不同语言的解决方案是可以的,但现在我有点兴趣,如果这可能在sed？

Answer 1

Ed *_*ton 5

sed 用于对各行进行简单替换，仅此而已。对于任何更有趣的事情，只需使用 awk：

$ cat tst.awk
{
    while ( match($0,/\s*([^0-9]+)([0-9]+)[^0-9]+([0-9,.]+)/,a) ) {
        print a[1] a[2] "/" a[3]
        $0 = substr($0,RSTART+RLENGTH)
    }
}
$ awk -f tst.awk file
foo is 1/1,200
and test is 1/3.4
something else is 1/2.5,
things are 1/10

Run Code Online (Sandbox Code Playgroud)

上面使用 GNU awk 作为第三个参数 tomatch()和\s的简写[[:space:]]。

Answer 2

pot*_*ong 4

这可能对你有用（GNU sed）：

sed -r 's/([0-9]) in ([0-9]\S*\s*)/\1\/\2\n/;/[0-9]\/[0-9]/P;D' file

Run Code Online (Sandbox Code Playgroud)

这将替换一些数字，后跟空格，in后跟空格，后跟以数字开头的标记，后跟可能的空格，第一个数字后跟，后跟/第二个以数字开头的标记，后跟新行。如果以下行包含一个数字，后跟一个 /`，后跟一个数字，则打印它，然后删除它，如果模式空间中有其他内容，则重复。

归档时间：	10 年，10 月前
查看次数：	88 次
最近记录：	9 年，1 月前