如何从访问日志中提取文本?

Zam*_*man 2 awk unix-text-processing

我在这方面很新。我正在尝试从新文件中的访问日志中提取一些文本。
我的日志文件是这样的:

111.111.111.111 - - [02/Jul/2021:18:35:19 +0000] "GET /api/items HTTP/2.0" 304 0 "https://example.com/some/text/call-log?roomNo=5003" "Mozilla etc etc etc etc"
111.111.111.111 - - [02/Jul/2021:20:35:19 +0000] "GET /api/items HTTP/2.0" 304 0 "https://example.com/some/text/resevation-log?roomNo=4003" "Mozilla etc etc etc etc"
Run Code Online (Sandbox Code Playgroud)

我想在新文件中以以下格式提取。

02/Jul/2021:18:35:19 +0000, call-log, 5003
02/Jul/2021:20:35:19 +0000, resevation-log, 4003
Run Code Online (Sandbox Code Playgroud)

到目前为止,我已经成功地执行了这个基本的 awk 命令:

awk '{print $4,$5,",",$11}' < /file.log
Run Code Online (Sandbox Code Playgroud)

这给了我以下输出:

[02/Jul/2021:18:35:19 +0000] , "https://example.com/some/text/call-log?roomNo=5003"
Run Code Online (Sandbox Code Playgroud)

Ed *_*ton 7

$ cat tst.awk
BEGIN {
    FS="[[:space:]]*[][\"][[:space:]]*"
    OFS = ", "
}
{
    n = split($6,f,"[/?=]")
    print $2, f[n-2], f[n]
}
Run Code Online (Sandbox Code Playgroud)

$ awk -f tst.awk file
02/Jul/2021:18:35:19 +0000, call-log, 5003
02/Jul/2021:20:35:19 +0000, resevation-log, 4003
Run Code Online (Sandbox Code Playgroud)

以上使用以下方式使用任何 POSIX awk 将问题中的输入拆分为字段:

$ cat tst.awk
BEGIN {
    FS="[[:space:]]*[][\"][[:space:]]*"
    OFS = ","
}
{
    print
    for (i=1; i<=NF; i++) {
        print "\t" i, "<" $i ">"
    }
    print "-----"
}
Run Code Online (Sandbox Code Playgroud)

$ awk -f tst.awk file
111.111.111.111 - - [02/Jul/2021:18:35:19 +0000] "GET /api/items HTTP/2.0" 304 0 "https://example.com/some/text/call-log?roomNo=5003" "Mozilla etc etc etc etc"
        1,<111.111.111.111 - ->
        2,<02/Jul/2021:18:35:19 +0000>
        3,<>
        4,<GET /api/items HTTP/2.0>
        5,<304 0>
        6,<https://example.com/some/text/call-log?roomNo=5003>
        7,<>
        8,<Mozilla etc etc etc etc>
        9,<>
-----
111.111.111.111 - - [02/Jul/2021:20:35:19 +0000] "GET /api/items HTTP/2.0" 304 0 "https://example.com/some/text/resevation-log?roomNo=4003" "Mozilla etc etc etc etc"
        1,<111.111.111.111 - ->
        2,<02/Jul/2021:20:35:19 +0000>
        3,<>
        4,<GET /api/items HTTP/2.0>
        5,<304 0>
        6,<https://example.com/some/text/resevation-log?roomNo=4003>
        7,<>
        8,<Mozilla etc etc etc etc>
        9,<>
-----
Run Code Online (Sandbox Code Playgroud)

如果您引用的任何字段可以包含[,]或转义的",则这将失败,这些字段都不存在于您的示例中,但如果它们可能发生,则将它们包含在您问题的示例中。