我是初学者R用户。我有两个巨大的数据框,我想在hkdata.2处添加一个名为Vaccine的新列,该数据是根据hkdata.2的2个参考列(hhID和成员)从另一个DF遵从性获取的,有人可以帮我吗?
hkdata.2
hhID member T0 delta X_hh X_fm ILI age
1 1 7 0 0 0 0 44
1 2 7 0 0 0 0 36
2 1 8 0 1 0 0 39
2 2 8 0 1 0 0 39
adherence
hhID member mask soap vaccine
1 0 1 0 1
1 1 1 1 1
1 2 0 0 1
2 0 1 0 0
2 1 0 0 0
2 2 1 0 1
Run Code Online (Sandbox Code Playgroud)
所以最后我可以得到这样的东西。在hkdata.2中增加了一个称为疫苗的列
hkdata.2
hhID member T0 delta X_hh X_fm ILI age vaccine
1 1 7 0 0 0 0 44 1
1 2 7 0 0 0 0 36 1
2 1 8 0 1 0 0 39 0
2 2 8 0 1 0 0 39 1
Run Code Online (Sandbox Code Playgroud)
更新:使用v1.9.6的on=语法。有关旧代码,请参见历史记录。
require(data.table) # v1.9.6+
setDT(hkdata.2)[setDT(adherence), vaccine := i.vaccine, on=c("hhID", "member")]
# hhID member T0 delta X_hh X_fm ILI age vaccine
# 1: 1 1 7 0 0 0 0 44 1
# 2: 1 2 7 0 0 0 0 36 1
# 3: 2 1 8 0 1 0 0 39 0
# 4: 2 2 8 0 1 0 0 39 1
Run Code Online (Sandbox Code Playgroud)
setDT通过引用将data.frame转换为data.table 。
对由指定的列执行联接on=。联接中使用的是您需要的Note that this join is both a) fast *and* b) memory efficient. a) *fast* because they're binary search based joins, and no copy is being made here at all. The疫苗column is directly added to yourhkdata.2 疫苗data.table. b) *memory efficient* because only the column,而不是其他列(对于非常大的数据集,这特别好)。
这是一个基准,假设每个100,000 hhID秒和200 member秒hhID:
require(data.table) # v1.9.6
require(dplyr) # 0.4.3.9000
set.seed(98192L)
N = 40e6 # 40 million rows
hkdata.2 = data.frame(hhID = rep(1:1e5, each=200),
member = 1:200,
T0 = sample(10),
delta = sample(0:1),
X_hh = sample(0:1),
X_fm = sample(0:1),
ILI = sample(0:1),
age = sample(30:100, N/2, TRUE))
# let's go with 100,000 hhIDs and 400 members here:
adherence = data.frame(hhID = rep(1:1e5, each=400),
member = 1:400,
mask = sample(0:1),
soap = sample(0:1),
vaccine = sample(0:1))
## dplyr timing
system.time(ans1 <- left_join(hkdata.2, select(adherence, -soap, -mask)))
# user system elapsed
# 16.977 2.163 19.605
## data.table timing
system.time(setDT(hkdata.2)[setDT(adherence), vaccine := i.vaccine, on=c("hhID", "member")])
# user system elapsed
# 1.186 0.233 1.427
Run Code Online (Sandbox Code Playgroud)
的峰值内存使用量dplyr为4.7GB,花费了19.6秒完成,而data.table花费了1.4秒,峰值内存使用量为2.2GB。
简介:
data.table在这里,速度提高了约14倍,内存效率提高了约2.1倍。