在R中选择表内行的快速方法?

Gor*_*man 7 sql row r sqldf data.table

我正在寻找一种从更大的表中提取大量行的快速方法.我的表顶部如下:

> head(dbsnp)

      snp      gene distance
rs5   rs5     KRIT1        1
rs6   rs6   CYP51A1        1
rs7   rs7 LOC401387        1
rs8   rs8      CDK6        1
rs9   rs9      CDK6        1
rs10 rs10      CDK6        1
Run Code Online (Sandbox Code Playgroud)

尺寸:

> dim(dbsnp)
[1] 11934948        3
Run Code Online (Sandbox Code Playgroud)

我想选择列表中包含rownames的行:

> head(features)
[1] "rs1367830" "rs5915027" "rs2060113" "rs1594503" "rs1116848" "rs1835693"

> length(features)
[1] 915635
Run Code Online (Sandbox Code Playgroud)

毫不奇怪,这样做的直接方式temptable = dbsnp[features,]需要相当长的时间.

我一直在研究如何通过R中的sqldf包来实现这一点.我认为这可能会更快.不幸的是,我无法弄清楚如何在SQL中选择具有某些rownames的行.

谢谢.

Jus*_*tin 10

data.table解决方案:

library(data.table)
dbsnp <- structure(list(snp = c("rs5", "rs6", "rs7", "rs8", "rs9", "rs10"
), gene = c("KRIT1", "CYP51A1", "LOC401387", "CDK6", "CDK6", 
"CDK6"), distance = c(1L, 1L, 1L, 1L, 1L, 1L)), .Names = c("snp", 
"gene", "distance"), class = "data.frame", row.names = c("rs5", 
"rs6", "rs7", "rs8", "rs9", "rs10"))

DT <- data.table(dbsnp, key='snp')
features <- c('rs5', 'rs7', 'rs9')
DT[features]

   snp      gene distance
1: rs5     KRIT1        1
2: rs7 LOC401387        1
3: rs9      CDK6        1
Run Code Online (Sandbox Code Playgroud)


shh*_*its 5

使用sqldf您将需要rownames = TRUE您可以使用row_names以下命令查询rownames :

library(sqldf)

## input

test<-read.table(header=T,text="      snp      gene distance
rs5   rs5     KRIT1        1
rs6   rs6   CYP51A1        1
rs7   rs7 LOC401387        1
rs8   rs8      CDK6        1
rs9   rs9      CDK6        1
rs10 rs10      CDK6        1
")
features<-c("rs5","rs7","rs10")

## calculate

inVar <- toString(shQuote(features, type = "csh")) # 'rs5','rs7','rs10'

fn$sqldf("SELECT * FROM test t
          WHERE t.row_names IN ($inVar)"
           , row.names = TRUE)

## result
#      snp      gene distance
#rs5   rs5     KRIT1        1
#rs7   rs7 LOC401387        1
#rs10 rs10      CDK6        1
Run Code Online (Sandbox Code Playgroud)

更新:或者,如果fet是一个数据框,其features列包含要查找的所需项目:

fet <- data.frame(features)
sqldf("SELECT t.* FROM test t
          WHERE t.row_names IN (SELECT features FROM fet)"
           , row.names = TRUE)
Run Code Online (Sandbox Code Playgroud)

此外,如果数据足够大,我们可以使用索引加快速度.有关此信息和其他详细信息,请参阅sqldf主页.


42-*_*42- 4

大多数人最初尝试的方式是:

dbsnp[ rownames(dbsnp) %in% features, ]  # which is probably slower than your code
Run Code Online (Sandbox Code Playgroud)

因为你说这需要很长时间,所以我怀疑你已经超出了 RAM 容量并开始使用虚拟内存。您应该关闭系统,然后仅使用 R 作为正在运行的应用程序重新启动,看看是否可以避免“进入虚拟化”。