使用空字段解析CSV,使用awk转义引号和逗号

Ben*_*fez 8 regex csv awk

我一直在用FPAT愉快地使用gawk.这是我用于示例的脚本:

#!/usr/bin/gawk -f

BEGIN {
    FPAT="([^,]*)|(\"[^\"]+\")"
}

{
    for (i=1; i<=NF; i++) {
        printf "Record #%s, field #%s: %s\n", NR, i, $i
    }
}
Run Code Online (Sandbox Code Playgroud)

简单,没有引号

效果很好.

$ echo 'a,b,c,d' | ./test.awk 
Record #1, field #1: a
Record #1, field #2: b
Record #1, field #3: c
Record #1, field #4: d
Run Code Online (Sandbox Code Playgroud)

有报价

效果很好.

$ echo '"a","b",c,d' | ./test.awk 
Record #1, field #1: "a"
Record #1, field #2: "b"
Record #1, field #3: c
Record #1, field #4: d
Run Code Online (Sandbox Code Playgroud)

用空列和引号

效果很好.

$ echo '"a","b",,d' | ./test.awk 
Record #1, field #1: "a"
Record #1, field #2: "b"
Record #1, field #3: 
Record #1, field #4: d
Run Code Online (Sandbox Code Playgroud)

使用转义引号,空列和引号

效果很好.

$ echo '"""a"": aaa","b",,d' | ./test.awk 
Record #1, field #1: """a"": aaa"
Record #1, field #2: "b"
Record #1, field #3: 
Record #1, field #4: d
Run Code Online (Sandbox Code Playgroud)

使用包含转义引号并以逗号结尾的列

失败.

$ echo '"""a"": aaa,","b",,d' | ./test.awk 
Record #1, field #1: """a"": aaa
Record #1, field #2: ","
Record #1, field #3: b"
Record #1, field #4: 
Record #1, field #5: d
Run Code Online (Sandbox Code Playgroud)

预期产量:

$ echo '"""a"": aaa,","b",,d' | ./test_that_would_be_working.awk 
Record #1, field #1: """a"": aaa,"
Record #1, field #2: "b"
Record #1, field #4: 
Record #1, field #5: d
Run Code Online (Sandbox Code Playgroud)

是否有一个FPAT的正则表达式可以使这个工作,或者这只是不支持awk?

"除了一个之外,该模式之后将是任何东西".正则表达式类搜索一次只能处理一个字符,因此它不能与a匹配"".

我认为可能有一个选择,但我不能很好地使它成功.

Mar*_*chs 4

因为 awk 的 FPAT 不知道环视,所以您需要明确您的模式。这个会做:

FPAT="[^,\"]*|\"([^\"]|\"\")*\""
Run Code Online (Sandbox Code Playgroud)

解释:

[^,\"]*             # match 0 or more times any character except , and "
|                   # OR
\"                  # match '"'
  ([^\"]            #   followed by 0 or more anything but '"'
   |                #   OR
   \"\"             #   '""'
  )*        
\"                  # ending with '"'
Run Code Online (Sandbox Code Playgroud)

现在测试一下:

$ cat tst.awk
BEGIN {
    FPAT="[^,\"]*|\"([^\"]|\"\")*\""
}
{ 
   for (i=1; i<=NF; i++){ printf "Record #%s, field #%s: %s\n", NR, i, $i }
}


$ echo '"""a"": aaa,","b",,d' | awk -f tst.awk
Record #1, field #1: """a"": aaa,"
Record #1, field #2: "b"
Record #1, field #3:
Record #1, field #4: d
Run Code Online (Sandbox Code Playgroud)