Tom*_*Tom 3 merge r data.table fuzzyjoin
我有两个要合并的数据库。来自此链接:与 data.table 进行“模糊”和非模糊、多对一合并。我知道当没有直接匹配时,我可以将这些 data.tables 与最近可用的年份合并,如下所示:
library(data.table)
dfA <- fread("
A B C D E F G Z iso year matchcode
1 0 1 1 1 0 1 0 NLD 2010 NLD2010
2 1 0 0 0 1 0 1 NLD 2014 NLD2014
3 0 0 0 1 1 0 0 AUS 2010 AUS2010
4 1 0 1 0 0 1 0 AUS 2006 AUS2006
5 0 1 0 1 0 1 1 USA 2008 USA2008
6 0 0 1 0 0 0 1 USA 2010 USA2010
7 0 1 0 1 0 0 0 USA 2012 USA2012
8 1 0 1 0 0 1 0 BLG 2008 BLG2008
9 0 1 0 1 1 0 1 BEL 2008 BEL2008
10 1 0 1 0 0 1 0 BEL 2010 BEL2010
11 0 1 1 1 0 1 0 NLD 2010 NLD2010
12 1 0 0 0 1 0 1 NLD 2014 NLD2014
13 0 0 0 1 1 0 0 AUS 2010 AUS2010
14 1 0 1 0 0 1 0 AUS 2006 AUS2006
15 0 1 0 1 0 1 1 USA 2008 USA2008
16 0 0 1 0 0 0 1 USA 2010 USA2010
17 0 1 0 1 0 0 0 USA 2012 USA2012
18 1 0 1 0 0 1 0 BLG 2008 BLG2008
19 0 1 0 1 1 0 1 BEL 2008 BEL2008
20 1 0 1 0 0 1 0 BEL 2010 BEL2010",
header = TRUE)
dfB <- fread("
A B C D H I J K iso year matchcode
1 0 1 1 1 0 1 0 NLD 2009 NLD2009
2 1 0 0 0 1 0 1 NLD 2014 NLD2018
3 0 0 0 1 1 0 0 AUS 2011 AUS2011
4 1 0 1 0 0 1 0 AUS 2007 AUS2007
5 0 1 0 1 0 1 1 USA 2007 USA2007
6 0 0 1 0 0 0 1 USA 2010 USA2010
7 0 1 0 1 0 0 0 USA 2013 USA2013
8 1 0 1 0 0 1 0 BLG 2007 BLG2007
9 0 1 0 1 1 0 1 BEL 2009 BEL2009
10 1 0 1 0 0 1 0 BEL 2012 BEL2012",
header = TRUE)
#change the name of the matchcode-column
setnames(dfA, c("matchcode", "iso", "year"), c("matchcodeA", "isoA", "yearA"))
setnames(dfB, c("matchcode", "iso", "year"), c("matchcodeB", "isoB", "yearB"))
#store column-order for in the end
namesA <- as.character( names( dfA ) )
namesB <- as.character( setdiff( names(dfB), names(dfA) ) )
colorder <- c(namesA, namesB)
#create columns to join on
dfA[, `:=`(iso.join = isoA, year.join = yearA)]
dfB[, `:=`(iso.join = isoB, year.join = yearB)]
#perform left join
result <- dfB[dfA, on = c("iso.join", "year.join"),roll = "nearest" ]
#drop columns that are not needed
result[, grep("^i\\.", names(result)) := NULL ]
result[, grep("join$", names(result)) := NULL ]
#set column order
setcolorder(result, colorder)
Run Code Online (Sandbox Code Playgroud)
我对此有两个问题。
1)编辑:这个问题是拼写错误的结果
2) NLD 2014in与indfA匹配。如果我认为 4 年差异太大并且我想将其限制为两年,我该怎么办?NLD 2018dfB
dfA当我想限制和之间允许的年数时,我该怎么办dfB?
您有两个选择:
roll = 2或roll = -2,将要求最近的距离仅在一个方向的 2 年内。dfA使其成为显式非等值连接。#perform left join
result <- dfB[dfA, on = c("iso.join", "year.join"), roll = 2 ]
# or
result <- dfB[dfA, on = c("iso.join", "year.join"), roll = -2 ]
Run Code Online (Sandbox Code Playgroud)
非等值连接需要您进行额外的工作,因为它不需要参数,roll = 'nearest'因此您需要在后续操作中使用mult = 'first'或执行过滤器。
dfA[, `:=`(min_year.join = yearA - 2,
max_year.join = yearA + 2)]
result <- dfB[dfA,
on = .(iso.join,
year.join <= max_year.join,
year.join >= min_year.join)
#, mult = 'first'
]
#drop columns that are not needed
result[, grep("^i\\.", names(result)) := NULL ]
result[, grep("join", names(result)) := NULL ] #removed $
#set column order
setcolorder(result, colorder)
result
Run Code Online (Sandbox Code Playgroud)