如何将一列数据拆分为多列数据

nic*_*ran 4 r dataframe dplyr

我的 TCGA 肿瘤样本 ID 列表如下:

SAMPLE_ID
TCGA.13.1407.01A.01R.1565.13
TCGA.24.2254.01A.01R.1568.13
TCGA.24.0982.01A.01R.1565.13
TCGA.24.1847.01A.01R.1566.13
TCGA.24.2289.01A.01R.1568.13
TCGA.31.1959.01A.01R.1568.13
Run Code Online (Sandbox Code Playgroud)

我想将样品 ID 拆分为:项目、TSS、参与者、样品、小瓶、部分、分析物、板、中心 例如,第一行为:

SAMPLE_ID                       Project TSS Participant Sample  Vial    Portion Analyte Plate   Center
TCGA.13.1407.01A.01R.1565.13    TCGA    13  1407        01      A       01      R       0182    13
Run Code Online (Sandbox Code Playgroud)

我尝试如下:

library(tidyr)
library(dplyr)
df = data.frame(SAMPLE_ID = c("TCGA.13.1407.01A.01R.1565.13", "TCGA.24.2254.01A.01R.1568.13", "TCGA.24.0982.01A.01R.1565.13",
                                "TCGA.24.1847.01A.01R.1566.13", "TCGA.24.2289.01A.01R.1568.13", "TCGA.31.1959.01A.01R.1568.13"))
Run Code Online (Sandbox Code Playgroud)

然后,

result = data %>% separate(SAMPLE_ID,
                            into = c("Project", "TSS", "Participant", "Sample", "Vial",
                                     "Portion", "Analyte", "Plate", "Center"),
                            sep = "\\.")
Run Code Online (Sandbox Code Playgroud)

但它给了我:

Project TSS Participant Sample  Vial    Portion Analyte Plate   Center
<chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>
TCGA    13  1407    01A 01R 1565    13  NA  NA
TCGA    24  2254    01A 01R 1568    13  NA  NA
TCGA    24  0982    01A 01R 1565    13  NA  NA
TCGA    24  1847    01A 01R 1566    13  NA  NA
TCGA    24  2289    01A 01R 1568    13  NA  NA
TCGA    31  1959    01A 01R 1568    13  NA  NA
Run Code Online (Sandbox Code Playgroud)

jay*_*.sf 5

使用read.fwf(正如@IRTFM建议的那样)可能是一个很好的解决方案。首先,我们可以删除句点并保存到tempfile.

> write.table(as.matrix(dat$SAMPLE_ID |> gsub('\\.', '', x=_)), tmp <- tempfile(),
+             col.names=FALSE, row.names=FALSE, quote=FALSE)
> 
> read.fwf(tmp, widths=c(4, 2, 4, 2, 1, 2, 1, 4, 2), header=FALSE, 
+          col.names=c("Project", "TSS", "Participant", "Sample", "Vial",
+                      "Portion", "Analyte", "Plate", "Center"))
  Project TSS Participant Sample Vial Portion Analyte Plate Center
1    TCGA  13        1407      1    A       1       R  1565     13
2    TCGA  24        2254      1    A       1       R  1568     13
3    TCGA  24         982      1    A       1       R  1565     13
4    TCGA  24        1847      1    A       1       R  1566     13
5    TCGA  24        2289      1    A       1       R  1568     13
6    TCGA  31        1959      1    A       1       R  1568     13
Run Code Online (Sandbox Code Playgroud)

您可能可以直接读取该文件并跳过该write.table步骤。

如果您依赖于字符列和前导零,您可以添加一些清理内容。

> read.fwf(tmp, widths=c(4, 2, 4, 2, 1, 2, 1, 4, 2), header=FALSE, 
+          col.names=c("Project", "TSS", "Participant", "Sample", "Vial",
+                      "Portion", "Analyte", "Plate", "Center")) |>
+   transform(TSS=sprintf('%02d', TSS), 
+             Participant=sprintf('%04d', Participant), 
+             Sample=sprintf('%02d', Sample),
+             Portion=sprintf('%02d', Portion),
+             Plate=as.character(Plate),
+             Center=as.character(Center))
  Project TSS Participant Sample Vial Portion Analyte Plate Center
1    TCGA  13        1407     01    A      01       R  1565     13
2    TCGA  24        2254     01    A      01       R  1568     13
3    TCGA  24        0982     01    A      01       R  1565     13
4    TCGA  24        1847     01    A      01       R  1566     13
5    TCGA  24        2289     01    A      01       R  1568     13
6    TCGA  31        1959     01    A      01       R  1568     13


> unlink(tmp) ## unlink if no longer needed
Run Code Online (Sandbox Code Playgroud)

数据:

> dput(dat)
structure(list(SAMPLE_ID = c("TCGA.13.1407.01A.01R.1565.13", 
"TCGA.24.2254.01A.01R.1568.13", "TCGA.24.0982.01A.01R.1565.13", 
"TCGA.24.1847.01A.01R.1566.13", "TCGA.24.2289.01A.01R.1568.13", 
"TCGA.31.1959.01A.01R.1568.13")), class = "data.frame", row.names = c(NA, 
-6L))
Run Code Online (Sandbox Code Playgroud)