如果列有多个值,请分别复制包含每个值的行

sch*_*ity 5 sed awk text-processing text-formatting

我有一个具有以下格式的文件,每列由制表符分隔:

\n
C1  C2  C3\na   b,c d\ne   f,g,h   i\nj   k   l\n...\n
Run Code Online (Sandbox Code Playgroud)\n

现在我需要根据第二列中用逗号分隔的值的数量(如果是这种情况)获得行数。这些行必须具有其中一个值,而不是其他值。结果是这样的:

\n
C1  C2  C3\na   b   d\na   c   d\ne   f   i\ne   g   i\ne   h   i\nj   k   l\n...\n...\n
Run Code Online (Sandbox Code Playgroud)\n

由于这是由于尽快工作,我刚刚制作了一个不要在家里执行此操作的脚本,while由于我缺乏 的技能awk,或者没有使用其他工具探索其他可能的解决方案,因此使用 逐行阅读。脚本如下:

\n

我正在同时修改剧本

\n
# DON\'T DO THIS AT HOME SCRIPT\n> duplicados.txt\nwhile IFS= read -r line; do\n  # get the value of the column of interest\n  cues="$(echo "$line" | awk -F\'\\t\' \'{ print $18 }\')"\n  # if the column has commas then it has multiple values\n  if [[ "$cues" =~ , ]]; then\n    # count the commas\n    c=$(printf "%s" "$cues" | sed \'s/[^,]*//g\' | wc -c)\n    # loop according to the number of commas\n    for i in $(seq $(($c + 1))); do\n      # get each value of the column of interest according to the position\n      cue="$(echo "$cues" | awk -F\',\' -v c=$i \'{ print $c; ++c }\')"\n      # save the line to a file substituting the whole column for the value\n      echo "$line" | sed "s;$cues;$cue;" >> duplicados.txt\n    done\n    continue\n  fi\n  # save the single value lines\n  echo "$line" >> duplicados.txt\ndone < inmuebles.txt\n
Run Code Online (Sandbox Code Playgroud)\n

这样我就得到了想要的结果(据我所知)。正如您可以想象的那样,该脚本速度缓慢且效率很低。我如何使用awk其他工具来做到这一点?

\n

真实数据的样本如下所示,感兴趣的列是数字 18:

\n
1409233 UNION   VIAMONTE    Estatal Provincial  DGEP    3321    VIAMONTE                            -33.7447365;-63.0997115 Rural Aglomerado    140273900   140273900-ESCUELA NICOLAS AVELLANEDA\n1402961 UNION   SAN MARCOS SUD  Estatal Provincial  DGEA, DGEI, DGEP    3029, 3311, Z11 SAN MARCOS SUD                          -32.629557;-62.483976 / -32.6302699949582;-62.4824499999125 / -32.632417;-62.484932 Urbano  140049404, 140164000, 140170100, 140173100  140049404-C.E.N.M.A. N\xc2\xb0 201 ANEXO SEDE SAN MARCOS SUD, 140164000-C.E.N.P.A. N\xc2\xb0 13 CASA DE LA CULTURA(DOC:BERSANO), 140170100-ESCUELA HIPOLITO BUCHARDO, 140173100-J.DE INF. HIPOLITO BUCHARDO\n1402960 UNION   SAN ANTONIO DE LITIN    Estatal Provincial  DGEA, DGEI, DGETyFP 3029, TZONAXI, Z11  SAN ANTONIO DE LITIN    3601300101020009    360102097366    0250347         SI / SI -32.212126;-62.635999 / -32.2122558;-62.6360432 / -32.2131931096409;-62.6291815804363   Rural Aglomerado    140049401, 140313000, 140313300, 140483400, 140499800   140049401-C.E.N.M.A. N\xc2\xb0 201 ANEXO SAN ANTONIO DE LITIN, 140313000-I.P.E.A. N\xc2\xba 214. MANUEL BELGRANO, 140313300-J.DE INF. PABLO A. PIZZURNO, 140483400-C.E.N.P.A. DE SAN ANTONIO DE LITIN, 140499800-C.E.N.P.A. B DE SAN ANTONIO DE LITIN\n
Run Code Online (Sandbox Code Playgroud)\n

ste*_*ver 10

awk您可以通过拆分复合列,并循环结果来完成此操作:

awk -F'\t' 'BEGIN{OFS=FS} {n=split($2,a,/,/); for(i=1;i<=n;i++){$2 = a[i]; print}}' file
Run Code Online (Sandbox Code Playgroud)

也许更干净,你可以用Miller来做到这一点- 特别是使用Nest 动词

$ cat file
C1      C2      C3
a       b,c     d
e       f,g,h   i
j       k       l

$ mlr --tsv nest --explode --values --across-records --nested-fs ',' -f C2 file
C1      C2      C3
a       b       d
a       c       d
e       f       i
e       g       i
e       h       i
j       k       l
Run Code Online (Sandbox Code Playgroud)

更紧凑的--explode --values --across-records --nested-fs ','可以替换为--evar ','