如何处理由bash脚本读取的CSV文件中的逗号

chr*_*ney 8 csv bash scripting

我正在创建一个bash脚本来从CSV文件生成一些输出(我有超过1000个条目,并不想手工做它...).

CSV文件的内容类似于:

Australian Capital Territory,AU-ACT,20034,AU,Australia
Piaui,BR-PI,20100,BR,Brazil
"Adygeya, Republic",RU-AD,21250,RU,Russian Federation
Run Code Online (Sandbox Code Playgroud)

我有一些代码可以使用逗号作为分隔符来分隔字段,但有些值实际上包含逗号,例如Adygeya, Republic.这些值用引号括起来表示其中的字符应该被视为字段的一部分,但我不知道如何解析它以将其考虑在内.

目前我有这个循环:

while IFS=, read province provinceCode criteriaId countryCode country
do
    echo "[$province] [$provinceCode] [$criteriaId] [$countryCode] [$country]"
done < $input
Run Code Online (Sandbox Code Playgroud)

它为上面给出的样本数据生成此输出:

[Australian Capital Territory] [AU-ACT] [20034] [AU] [Australia]
[Piaui] [BR-PI] [20100] [BR] [Brazil]
["Adygeya] [ Republic"] [RU-AD] [21250] [RU,Russian Federation]
Run Code Online (Sandbox Code Playgroud)

如您所见,第三个条目的解析不正确.我希望它输出

[Adygeya Republic] [RU-AD] [21250] [RU] [Russian Federation]
Run Code Online (Sandbox Code Playgroud)

Dim*_*lov 8

如果你想在awk中完成所有操作(此脚本需要GNU awk 4才能按预期工作):

awk '{ 
 for (i = 0; ++i <= NF;) {
   substr($i, 1, 1) == "\"" && 
     $i = substr($i, 2, length($i) - 2)
   printf "[%s]%s", $i, (i < NF ? OFS : RS)
    }   
 }' FPAT='([^,]+)|("[^"]+")' infile
Run Code Online (Sandbox Code Playgroud)

样本输出:

% cat infile
Australian Capital Territory,AU-ACT,20034,AU,Australia
Piaui,BR-PI,20100,BR,Brazil
"Adygeya, Republic",RU-AD,21250,RU,Russian Federation
% awk '{    
 for (i = 0; ++i <= NF;) {
   substr($i, 1, 1) == "\"" &&
     $i = substr($i, 2, length($i) - 2)
   printf "[%s]%s", $i, (i < NF ? OFS : RS)
    }
 }' FPAT='([^,]+)|("[^"]+")' infile
[Australian Capital Territory] [AU-ACT] [20034] [AU] [Australia]
[Piaui] [BR-PI] [20100] [BR] [Brazil]
[Adygeya, Republic] [RU-AD] [21250] [RU] [Russian Federation]
Run Code Online (Sandbox Code Playgroud)

使用Perl:

perl -MText::ParseWords -lne'
 print join " ", map "[$_]", 
   parse_line(",",0, $_);
  ' infile 
Run Code Online (Sandbox Code Playgroud)

这应该适用于您的awk版本(基于 c.us帖子,也删除了嵌入式逗号).

awk '{
 n = parse_csv($0, data)
 for (i = 0; ++i <= n;) {
    gsub(/,/, " ", data[i])
    printf "[%s]%s", data[i], (i < n ? OFS : RS)
    }
  }
function parse_csv(str, array,   field, i) { 
  split( "", array )
  str = str ","
  while ( match(str, /[ \t]*("[^"]*(""[^"]*)*"|[^,]*)[ \t]*,/) ) { 
    field = substr(str, 1, RLENGTH)
    gsub(/^[ \t]*"?|"?[ \t]*,$/, "", field)
    gsub(/""/, "\"", field)
    array[++i] = field
    str = substr(str, RLENGTH + 1)
  }
  return i
}' infile
Run Code Online (Sandbox Code Playgroud)


jay*_*ngh 5

细算@ Dimitre的解决方案在这里.你可以这样做 -

#!/usr/local/bin/gawk -f

BEGIN {
    FS="," 
    FPAT="([^,]+)|(\"[^\"]+\")"
    }

      {
    for (i=1;i<=NF;i++) 
        printf ("[%s] ",$i);
    print ""
    } 
Run Code Online (Sandbox Code Playgroud)

测试:

[jaypal:~/Temp] cat filename
Australian Capital Territory,AU-ACT,20034,AU,Australia
Piaui,BR-PI,20100,BR,Brazil
"Adygeya, Republic",RU-AD,21250,RU,Russian Federation

[jaypal:~/Temp] ./script.awk  filename
[Australian Capital Territory] [AU-ACT] [20034] [AU] [Australia] 
[Piaui] [BR-PI] [20100] [BR] [Brazil] 
["Adygeya, Republic"] [RU-AD] [21250] [RU] [Russian Federation] 
Run Code Online (Sandbox Code Playgroud)

要删除,"您可以输出到sed.

[jaypal:~/Temp] ./script.awk  filename | sed 's#\"##g'
[Australian Capital Territory] [AU-ACT] [20034] [AU] [Australia] 
[Piaui] [BR-PI] [20100] [BR] [Brazil] 
[Adygeya, Republic] [RU-AD] [21250] [RU] [Russian Federation] 
Run Code Online (Sandbox Code Playgroud)