合并在循环R中

Question

合并在循环R中

我正在使用for循环将多个文件与另一个文件合并:

files < - list.files("path",pattern =".TXT",ignore.case = T)

for(i in 1:length(files))
{
  data <- fread(files[i], header=T)

  # Merge
  mydata <- merge(mydata, data, by="ID", all.x=TRUE)

  rm(data)
}

Run Code Online (Sandbox Code Playgroud)

"mydata"看起来如下(简化):

Run Code Online (Sandbox Code Playgroud)

"data"看起来如下(大约600个文件,总共100GB).2(单独)文件的示例.将所有内容整合为1将是不可能的(太大):

Run Code Online (Sandbox Code Playgroud)

当我运行我的代码时,我得到以下数据集:

ID  x1  x2  x3.x    x3.y
1   2   8   8       NA
2   5   5   4       NA
3   4   4   NA      4
4   6   5   NA      5
5   5   8   NA      1

Run Code Online (Sandbox Code Playgroud)

我想得到的是:

ID  x1  x2  x3
1   2   8   8
2   5   5   4
3   4   4   4
4   6   5   5
5   5   8   1

Run Code Online (Sandbox Code Playgroud)

ID是唯一的(永远不会重复600个文件).

关于如何尽可能高效地实现这一点的任何想法都非常感激.

Answer 1

Jav*_*Jav 5

它更适合评论,但我还不能发表评论.

rbind而不是合并会不会更好？这似乎是你想要实现的.

设置fill参数TRUE以处理不同的列号:

asd <- data.table(x1 = c(1, 2), x2 = c(4, 5))
a <- data.table(x2 = 5)
rbind(asd, a, fill = TRUE)

   x1 x2
1:  1  4
2:  2  5
3: NA  5

Run Code Online (Sandbox Code Playgroud)

以做到这一点data,然后合并到mydata通过ID.

更新评论

files <- list.files("path", pattern=".TXT", ignore.case=T)

ff <- function(input){
  data <- fread(input) 
}

a <- lapply(files, ff)
library(plyr)
binded.data <- ldply(a, function(x) rbind(x, fill = TRUE))

Run Code Online (Sandbox Code Playgroud)

因此,这会创建一个读取文件并将其推送到的函数lapply,因此您将获得一个包含所有data文件的列表,每个文件都在自己的数据框中.

将rbind ldply从plyr所有数据帧转换为一个数据帧.

不要碰mydata.

binded.data <- data.table(binded.data, key = ID)

Run Code Online (Sandbox Code Playgroud)

根据您的不同,您mydata将执行不同的merge命令.请参阅:https: //rstudio-pubs-static.s3.amazonaws.com/52230_5ae0d25125b544caab32f75f0360e775.html

更新2

files <- list.files("path", pattern=".TXT", ignore.case=T)

ff <- function(input){
data <- fread(input)
# This keeps only the rows of 'data' whose ID matches ID of 'mydata'
data <- data[ID %in% mydata[, ID]]
}

a <- lapply(files, ff)
library(plyr)
binded.data <- ldply(a, function(x) rbind(x, fill = TRUE))

Run Code Online (Sandbox Code Playgroud)

更新3

您可以添加cat以查看该功能正在读取的文件.所以你可以看到你的内存耗尽了哪个文件.这将指出您可以一次性阅读多少文件的方向.

  ff <- function(input){
# This will print name of the file it is reading now
cat(input, "\n")
data <- fread(input)
# This keeps only the rows of 'data' whose ID matches ID of 'mydata'
data <- data[ID %in% mydata[, ID]]
}

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年，6 月前
查看次数：	4303 次
最近记录：	9 年，6 月前