bash 从其他文件中添加/追加新列

Question

bash 从其他文件中添加/追加新列

我有一列的 name.txt 文件，例如

A
B
C
D
E
F

Run Code Online (Sandbox Code Playgroud)

然后我有很多文件，egxtxt, y.txt 和 z.txt

x.txt 有

A 1
C 3
D 2

Run Code Online (Sandbox Code Playgroud)

y.txt 有

A 1
B 4
E 3

Run Code Online (Sandbox Code Playgroud)

z.txt 有

B 2
D 2
F 1

Run Code Online (Sandbox Code Playgroud)

理想的输出是（如果没有映射就填0）

Run Code Online (Sandbox Code Playgroud)

可以用bash制作吗？（也许是 awk？）
非常感谢！！！

第一次编辑 - 我的尝试性工作
由于我对 bash 还很陌生，所以我真的很难用 awk 找出可能的解决方案。我更熟悉R，可以通过

namematrix[namematrix[,1]==xmatrix[,1],]

Run Code Online (Sandbox Code Playgroud)

总而言之，我真的很感谢下面的帮助，帮助我更多地了解awk和join！

第二次编辑 - 找到了一种超级有效的方法！

幸运的是，受到以下一些非常出色的答案的启发，我整理了一种计算效率非常高的方法，如下所示。这对于遇到类似问题的其他人可能会有所帮助，尤其是当他们处理非常大的大量文件时。

首先触摸 join_awk.bash

A
B
C
D
E
F

Run Code Online (Sandbox Code Playgroud)

例如，为 name.txt 和 x.txt 执行这个 bash 脚本

join_awk.bash name.txt x.txt

Run Code Online (Sandbox Code Playgroud)

会产生

Run Code Online (Sandbox Code Playgroud)

请注意，这里我只保留第二列以节省磁盘空间，因为在我的数据集中，第一列是非常长的名称，会占用大量磁盘空间。

然后简单地实现

A 1
C 3
D 2

Run Code Online (Sandbox Code Playgroud)

这是受到下面使用 GNU 并行和连接的精彩答案的启发。不同之处在于j1，parallel由于其串行附加逻辑，下面的答案必须指定for ，这使得它不是真正的“并行”。此外，随着串行追加的继续，速度会越来越慢。相比之下，这里我们分别并行操作每个文件。当我们使用多个 CPU 处理大量大文件时，它可以非常快。

最后只需将所有单列输出文件合并在一起

A 1
B 4
E 3

Run Code Online (Sandbox Code Playgroud)

这也将非常快，因为paste本质上是并行的。

Answer 1

anu*_*ava 13

你可以使用这个awk：

awk 'NF == 2 {
   map[FILENAME,$1] = $2
   next
}
{
   printf "%s", $1
   for (f=1; f<ARGC-1; ++f)
      printf "%s", OFS map[ARGV[f],$1]+0
   print ""
}' {x,y,z}.txt name.txt

Run Code Online (Sandbox Code Playgroud)

Run Code Online (Sandbox Code Playgroud)

这是一种非常简洁的方法。做得好。（这是 POSIX） (3认同)
总有一天我会消化足够多的东西来首先看到优雅的方法`:)` (2认同)

Answer 2

Rav*_*h13 10

添加另一种方法。您能否尝试使用所示示例进行以下，编写和测试。恕我直言，应该适用于任何awk，尽管我只有 3.1 版本的 GNU awk。这是非常简单和常用的方法，在第一个（主要）Input_file 的读取中创建一个数组，然后在每个文件中添加0在该特定 Input_file 中未找到该数组的任何元素，仅使用小给定样本进行测试。

awk '
function checkArray(array){
  for(i in array){
    if(!(i in found)){ array[i]=array[i] OFS "0" }
  }
}
FNR==NR{
  arr[$0]
  next
}
foundCheck && FNR==1{
 checkArray(arr)
  delete found
  foundCheck=""
}
{
  if($1 in arr){
    arr[$1]=(arr[$1] OFS $2)
    found[$1]
    foundCheck=1
    next
  }
}
END{
  checkArray(arr)
  for(key in arr){
    print key,arr[key]
  }
}
' name.txt x.txt y.txt  z.txt

Run Code Online (Sandbox Code Playgroud)

说明：为以上添加详细说明。

awk '                               ##Starting awk program from here.
function checkArray(array){         ##Creating a function named checkArray from here.
  for(i in array){                  ##CTraversing through array here.
    if(!(i in found)){ array[i]=array[i] OFS "0" }   ##Checking condition if key is NOT in found then append a 0 in that specific value.
  }
}
FNR==NR{                            ##Checking condition if FNR==NR which will be TRUE when names.txt is being read.
  arr[$0]                           ##Creating array with name arr with index of current line.
  next                              ##next will skip all further statements from here.
}
foundCheck && FNR==1{               ##Checking condition if foundCheck is SET and this is first line of Input_file.
 checkArray(arr)                    ##Calling function checkArray by passing arr array name in it.
  delete found                      ##Deleting found array to get rid of previous values.
  foundCheck=""                     ##Nullifying foundCheck here.
}
{
  if($1 in arr){                    ##Checking condition if 1st field is present in arr.
    arr[$1]=(arr[$1] OFS $2)        ##Appening 2nd field value to arr with index of $1.
    found[$1]                       ##Adding 1st field to found as an index here.
    foundCheck=1                    ##Setting foundCheck here.
    next                            ##next will skip all further statements from here.
  }
}
END{                                ##Starting END block of this program from here.
  checkArray(arr)                   ##Calling function checkArray by passing arr array name in it.
  for(key in arr){                  ##Traversing thorugh arr here.
    print key,arr[key]              ##Printing index and its value here.
  }
}
' name.txt x.txt y.txt z.txt        ##Mentioning Input_file names here.

Run Code Online (Sandbox Code Playgroud)

Answer 3

Dav*_*ica 7

是的，你可以做到，是的，awk是工具。使用数组和您的正常文件行号（FNR 记录的文件数）和总行数（NR 记录），您可以将所有字母读names.txt入a[]数组，然后跟踪变量中的文件号fno，您可以添加所有的添加x.txt，然后在处理下一个文件 ( y.txt)的第一行之前，循环遍历最后一个文件中看到的所有字母，对于那些没有看到的放置 a 0，然后继续正常处理。对每个附加文件重复此操作。

进一步的逐行解释显示在评论中：

awk '
    FNR==NR {                           # first file
        a[$1] = ""                      # fill array with letters as index
        fno = 1                         # set file number counter
        next                            # get next record (line)
    }
    FNR == 1 { fno++ }                  # first line in file, increment file count
    fno > 2 && FNR == 1 {               # file no. 3+ (not run on x.txt)
        for (i in a)                    # loop over letters 
            if (!(i in seen))           # if not in seen array
                a[i] = a[i]" "0         # append 0
        delete seen                     # delete seen array
    }
    $1 in a {                           # if line begins with letter in array
        a[$1] = a[$1]" "$2              # append second field
        seen[$1]++                      # add letter to seen array
    }
END {
    for (i in a)                        # place zeros for last column
        if (!(i in seen))
            a[i] = a[i]" "0
    for (i in a)                        # print results
        print i a[i]
}' name.txt x.txt y.txt z.txt

Run Code Online (Sandbox Code Playgroud)

示例使用/输出

只需将上述内容和鼠标中键复制到包含文件的当前目录的 xterm 中，您将收到：

Run Code Online (Sandbox Code Playgroud)

创建自包含脚本

如果您想创建一个脚本来运行而不是在命令行中粘贴，您只需包含内容（不要用单引号括起来），然后使文件可执行。例如，您将解释器作为第一行，内容如下：

#!/usr/bin/awk -f

FNR==NR {                           # first file
    a[$1] = ""                      # fill array with letters as index
    fno = 1                         # set file number counter
    next                            # get next record (line)
}
FNR == 1 { fno++ }                  # first line in file, increment file count
fno > 2 && FNR == 1 {               # file no. 3+ (not run on x.txt)
    for (i in a)                    # loop over letters 
        if (!(i in seen))           # if not in seen array
            a[i] = a[i]" "0         # append 0
    delete seen                     # delete seen array
}
$1 in a {                           # if line begins with letter in array
    a[$1] = a[$1]" "$2              # append second field
    seen[$1]++                      # add letter to seen array
}
END {
    for (i in a)                    # place zeros for last column
        if (!(i in seen))
            a[i] = a[i]" "0
    for (i in a)                    # print results
        print i a[i]
}

Run Code Online (Sandbox Code Playgroud)

awk 将按照给定的顺序处理作为参数给出的文件名。

示例使用/输出

使用脚本文件（我把它放进去names.awk然后用来chmod +x names.awk使它可执行），然后你会做：

$ ./names.awk name.txt x.txt y.txt z.txt
A 1 1 0
B 0 4 2
C 3 0 0
D 2 0 2
E 0 3 0
F 0 0 1

Run Code Online (Sandbox Code Playgroud)

如果您还有其他问题，请告诉我。

归档时间：	5 年，6 月前
查看次数：	285 次
最近记录：	5 年，6 月前