从文件夹中的所有tsv文件中提取前三列

Question

从文件夹中的所有tsv文件中提取前三列

我在一个文件夹中有几个tsv文件,总计超过50 GB.为了在将这些文件加载到R中时使内存更容易,我想只提取这些文件的前3列.

如何在终端中一次性提取所有文件的列？我正在运行Ubuntu 16.04.

Answer 1

像下面这样的东西应该工作:

#!/bin/bash
FILES=/path/to/*
for f in $FILES
do
    # Do something for each file. In our case, just echo the first three fields:
    cut -f1-3 < "$f"
done

Run Code Online (Sandbox Code Playgroud)

(有关在bash中迭代文件的更多信息,请参阅此网页.)

M. Becerra的答案包含一个单行程序,其中使用find命令可以实现相同的目的.因此,我自己的答案可能被认为比必要的更复杂,除非你想为每个文件做额外的处理(例如,在迭代文件时构造一些统计数据).

编辑:如果要覆盖实际文件,可以使用类似以下脚本的内容:

#!/bin/bash
FILES=/path/to/*
for f in $FILES
do
    # Do something for each file. In our case, echo the first three fields to a new file, and rename the new file to the original file:
    cut -f1-3 < "$f" > "$f.tmp"
    rm "$f"
    mv "$f.tmp" "$f"
done

Run Code Online (Sandbox Code Playgroud)

该cut行将其输出写入原始文件名并.tmp附加; 以下两行删除原始文件并将新文件重命名为原始文件名.

Answer 2

Tob*_*zel 5

这看起来像是cut实用程序的完美用例

您可以按如下方式使用它:

cut -d$"\t" -f 1-3 folder/*

Run Code Online (Sandbox Code Playgroud)

Where -d指定字段分隔符(在本例中为tabs),-f指定要提取的字段,并指定要解析的所有文件folder/*的glob.

Answer 3

M. *_*rra 3

你可以做：

find ./ -type f -name ".tsv" -exec awk '{ print $1,$2,$3 }' {} \;

Run Code Online (Sandbox Code Playgroud)

您可以从文件所在的目录运行它，或者只添加绝对路径。

如果您想将其保存到文件中，您可以重定向以下输出awk：

find ./ -type f -name ".tsv" -exec awk '{ print $1,$2,$3 }' {} >> someOtherFile \;

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年前
查看次数：	714 次
最近记录：	8 年前