我的 TCGA 肿瘤样本 ID 列表如下:
SAMPLE_ID
TCGA.13.1407.01A.01R.1565.13
TCGA.24.2254.01A.01R.1568.13
TCGA.24.0982.01A.01R.1565.13
TCGA.24.1847.01A.01R.1566.13
TCGA.24.2289.01A.01R.1568.13
TCGA.31.1959.01A.01R.1568.13
Run Code Online (Sandbox Code Playgroud)
我想将样品 ID 拆分为:项目、TSS、参与者、样品、小瓶、部分、分析物、板、中心 例如,第一行为:
SAMPLE_ID Project TSS Participant Sample Vial Portion Analyte Plate Center
TCGA.13.1407.01A.01R.1565.13 TCGA 13 1407 01 A 01 R 0182 13
Run Code Online (Sandbox Code Playgroud)
我尝试如下:
library(tidyr)
library(dplyr)
df = data.frame(SAMPLE_ID = c("TCGA.13.1407.01A.01R.1565.13", "TCGA.24.2254.01A.01R.1568.13", "TCGA.24.0982.01A.01R.1565.13",
"TCGA.24.1847.01A.01R.1566.13", "TCGA.24.2289.01A.01R.1568.13", "TCGA.31.1959.01A.01R.1568.13"))
Run Code Online (Sandbox Code Playgroud)
然后,
result = data %>% separate(SAMPLE_ID,
into = c("Project", "TSS", "Participant", "Sample", "Vial",
"Portion", "Analyte", "Plate", "Center"),
sep = "\\.")
Run Code Online (Sandbox Code Playgroud)
但它给了我:
Project TSS Participant Sample Vial Portion Analyte Plate Center
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
TCGA 13 1407 01A 01R 1565 13 NA NA
TCGA 24 2254 01A 01R 1568 13 NA NA
TCGA 24 0982 01A 01R 1565 13 NA NA
TCGA 24 1847 01A 01R 1566 13 NA NA
TCGA 24 2289 01A 01R 1568 13 NA NA
TCGA 31 1959 01A 01R 1568 13 NA NA
Run Code Online (Sandbox Code Playgroud)
使用read.fwf(正如@IRTFM建议的那样)可能是一个很好的解决方案。首先,我们可以删除句点并保存到tempfile.
> write.table(as.matrix(dat$SAMPLE_ID |> gsub('\\.', '', x=_)), tmp <- tempfile(),
+ col.names=FALSE, row.names=FALSE, quote=FALSE)
>
> read.fwf(tmp, widths=c(4, 2, 4, 2, 1, 2, 1, 4, 2), header=FALSE,
+ col.names=c("Project", "TSS", "Participant", "Sample", "Vial",
+ "Portion", "Analyte", "Plate", "Center"))
Project TSS Participant Sample Vial Portion Analyte Plate Center
1 TCGA 13 1407 1 A 1 R 1565 13
2 TCGA 24 2254 1 A 1 R 1568 13
3 TCGA 24 982 1 A 1 R 1565 13
4 TCGA 24 1847 1 A 1 R 1566 13
5 TCGA 24 2289 1 A 1 R 1568 13
6 TCGA 31 1959 1 A 1 R 1568 13
Run Code Online (Sandbox Code Playgroud)
您可能可以直接读取该文件并跳过该write.table步骤。
如果您依赖于字符列和前导零,您可以添加一些清理内容。
> read.fwf(tmp, widths=c(4, 2, 4, 2, 1, 2, 1, 4, 2), header=FALSE,
+ col.names=c("Project", "TSS", "Participant", "Sample", "Vial",
+ "Portion", "Analyte", "Plate", "Center")) |>
+ transform(TSS=sprintf('%02d', TSS),
+ Participant=sprintf('%04d', Participant),
+ Sample=sprintf('%02d', Sample),
+ Portion=sprintf('%02d', Portion),
+ Plate=as.character(Plate),
+ Center=as.character(Center))
Project TSS Participant Sample Vial Portion Analyte Plate Center
1 TCGA 13 1407 01 A 01 R 1565 13
2 TCGA 24 2254 01 A 01 R 1568 13
3 TCGA 24 0982 01 A 01 R 1565 13
4 TCGA 24 1847 01 A 01 R 1566 13
5 TCGA 24 2289 01 A 01 R 1568 13
6 TCGA 31 1959 01 A 01 R 1568 13
> unlink(tmp) ## unlink if no longer needed
Run Code Online (Sandbox Code Playgroud)
数据:
> dput(dat)
structure(list(SAMPLE_ID = c("TCGA.13.1407.01A.01R.1565.13",
"TCGA.24.2254.01A.01R.1568.13", "TCGA.24.0982.01A.01R.1565.13",
"TCGA.24.1847.01A.01R.1566.13", "TCGA.24.2289.01A.01R.1568.13",
"TCGA.31.1959.01A.01R.1568.13")), class = "data.frame", row.names = c(NA,
-6L))
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
114 次 |
| 最近记录: |