使用awk或sed根据第1,第8和第9列值选择矩阵第一行

vch*_*ngs 4 awk grep r sed

我有一些行,第1列,第8列和第9列大致相同.行总数超过60K.现在我想简化只保留第1列,第8列和第9列相同的第一行.

输入文件:

chr exon_start  exon_end    cnv tumor_DOC   control_DOC rationormalized_after_smoothing CNV_start   CNV_end seg_mean
chr1    762097  762270  3   821 717 1.456610215 762097  6706109 1.297328502
chr1    861281  861490  3   101 117 1.29744744  762097  6706109 1.297328502
chr1    7868860 7869039 2   78  119 1.123385189 7796356 8921423 1.088752407
chr1    7869841 7870041 2   140 169 1.123385189 7796356 8921423 1.088752407
chr1    7870411 7870596 2   83  163 1.123385189 7796356 8921423 1.088752407
chr1    7879297 7879467 2   290 360 1.024742732 7796356 8921423 1.088752407
chr1    21012415    21012609    3   89  135 1.230421209 19536504    21054539    1.247494175
chr1    21013924    21014512    3   234 219 1.359224182 19536504    21054539    1.247494175
chr1    21016588    21016803    3   172 179 1.230421209 19536504    21054539    1.247494175
chr1    21024895    21025101    3   147 120 1.230421209 19536504    21054539    1.247494175
chr14   20920169    20920704    3   211 214 1.254261327 20840851    20923828    1.288877208
chr14   20922716    20922919    3   253 262 1.228396526 20840851    20923828    1.288877208
chr14   20923634    20923828    3   188 201 1.206226522 20840851    20923828    1.288877208
chr14   20924141    20924329    2   244 344 0.902299535 20924141    21465086    1.088234038
chr14   20924787    20925701    2   314 306 1.305351797 20924141    21465086    1.088234038
chr14   20926636    20926836    2   134 136 1.206226522 20924141    21465086    1.088234038
Run Code Online (Sandbox Code Playgroud)

期望的输出:

chr exon_start  exon_end    cnv tumor_DOC   control_DOC rationormalized_after_smoothing CNV_start   CNV_end seg_mean
chr1    762097  762270  3   821 717 1.456610215 762097  6706109 1.297328502
chr1    7869841 7870041 2   140 169 1.123385189 7796356 8921423 1.088752407
chr1    21024895    21025101    3   147 120 1.230421209 19536504    21054539    1.247494175
chr14   20922716    20922919    3   253 262 1.228396526 20840851    20923828    1.288877208
chr14   20924141    20924329    2   244 344 0.902299535 20924141    21465086    1.088234038
Run Code Online (Sandbox Code Playgroud)

对于具有类似column1,第8列和第9列的每个不同类别,我只保留一行,最好是只要在发生更改时保留第一行.

我怎样才能在awk,sed或R中实现这一点?

fed*_*qui 5

只需awk一行:

awk '!seen[$1,$8,$9]++' file
Run Code Online (Sandbox Code Playgroud)

这样可以跟踪到目前为止seen[]给定元组(field1, field8, field9)出现的次数.当它看到一个时,计数器递增1.当值已经为1或更大时,!value将被评估为False,因此awk不会打印该行.

第一次:

  • seen[$1,$8,$9]0(未设置时的默认值).
  • !0 计算结果为True,因此打印行.
  • seen[$1,$8,$9] 递增1.

下次:

  • seen[$1,$8,$9]1或更多.
  • !1计算结果为False,因此打印该行.
  • seen[$1,$8,$9] 递增1.

测试

$ awk '!seen[$1,$8,$9]++' a
chr exon_start  exon_end    cnv tumor_DOC   control_DOC rationormalized_after_smoothing CNV_start   CNV_end seg_mean
chr1    762097  762270  3   821 717 1.456610215 762097  6706109 1.297328502
chr1    7868860 7869039 2   78  119 1.123385189 7796356 8921423 1.088752407
chr1    21012415    21012609    3   89  135 1.230421209 19536504    21054539    1.247494175
chr14   20920169    20920704    3   211 214 1.254261327 20840851    20923828    1.288877208
chr14   20924141    20924329    2   244 344 0.902299535 20924141    21465086    1.088234038
Run Code Online (Sandbox Code Playgroud)


Rol*_*and 4

将数据导入 R(您将指定文件):

DF <- read.table(text = "chr exon_start  exon_end    cnv tumor_DOC   control_DOC rationormalized_after_smoothing CNV_start   CNV_end seg_mean
chr1    762097  762270  3   821 717 1.456610215 762097  6706109 1.297328502
chr1    861281  861490  3   101 117 1.29744744  762097  6706109 1.297328502
chr1    7868860 7869039 2   78  119 1.123385189 7796356 8921423 1.088752407
chr1    7869841 7870041 2   140 169 1.123385189 7796356 8921423 1.088752407
chr1    7870411 7870596 2   83  163 1.123385189 7796356 8921423 1.088752407
chr1    7879297 7879467 2   290 360 1.024742732 7796356 8921423 1.088752407
chr1    21012415    21012609    3   89  135 1.230421209 19536504    21054539    1.247494175
chr1    21013924    21014512    3   234 219 1.359224182 19536504    21054539    1.247494175
chr1    21016588    21016803    3   172 179 1.230421209 19536504    21054539    1.247494175
chr1    21024895    21025101    3   147 120 1.230421209 19536504    21054539    1.247494175
chr14   20920169    20920704    3   211 214 1.254261327 20840851    20923828    1.288877208
chr14   20922716    20922919    3   253 262 1.228396526 20840851    20923828    1.288877208
chr14   20923634    20923828    3   188 201 1.206226522 20840851    20923828    1.288877208
chr14   20924141    20924329    2   244 344 0.902299535 20924141    21465086    1.088234038
chr14   20924787    20925701    2   314 306 1.305351797 20924141    21465086    1.088234038
chr14   20926636    20926836    2   134 136 1.206226522 20924141    21465086    1.088234038", header = TRUE)
Run Code Online (Sandbox Code Playgroud)

提取第 1、8、9 列与前面的行不重复的行:

DF[!duplicated(DF[, c(1,8,9)]),]
#     chr exon_start exon_end cnv tumor_DOC control_DOC rationormalized_after_smoothing CNV_start  CNV_end seg_mean
#1   chr1     762097   762270   3       821         717                       1.4566102    762097  6706109 1.297329
#3   chr1    7868860  7869039   2        78         119                       1.1233852   7796356  8921423 1.088752
#7   chr1   21012415 21012609   3        89         135                       1.2304212  19536504 21054539 1.247494
#11 chr14   20920169 20920704   3       211         214                       1.2542613  20840851 20923828 1.288877
#14 chr14   20924141 20924329   2       244         344                       0.9022995  20924141 21465086 1.088234
Run Code Online (Sandbox Code Playgroud)