我正在尝试在两个表之间进行复杂的非equi连接.我在最后一次使用R2016的演示中受到启发(https://channel9.msdn.com/events/useR-international-R-User-conference/useR2016/Efficient-in-memory-non-equi-joins-using-datatable这使我相信它将是data.table的合适任务.我的表1看起来像:
library(data.table)
sp <- c("SAB","SAB","SAB","SAB","EPN","EPN","BOP","BOP","BOP","BOP","BOP","PET","PET","PET")
dbh <- c(10,12,16,22,12,16,10,12,14,20,26,12,16,18)
dt1 <- data.table(sp,dbh)
dt1
sp dbh
1: SAB 10
2: SAB 12
3: SAB 16
4: SAB 22
5: EPN 12
6: EPN 16
7: BOP 10
8: BOP 12
9: BOP 14
10: BOP 20
11: BOP 26
12: PET 12
13: PET 16
14: PET 18
Run Code Online (Sandbox Code Playgroud)
这是dbh的树木列表.我的第二个表(下面)给出了一个通用表,它为每个树种提供了一系列dbh来对大小类或树进行分类:
gr_sp <- c("RES","RES","RES","RES","RES","RES", "DEC", "DEC", "DEC", "DEC", "DEC", "DEC")
sp <- c("SAB","SAB", "SAB", "EPN", "EPN", "EPN", "BOP", "BOP", "BOP", "PET", "PET", …Run Code Online (Sandbox Code Playgroud) 我有两个大数据集,df1和df2.第一个数据集df1包含列'ID'和'actual.data'.
df1 <- data.frame(ID=c(1,1,1,2,3,4,4), actual.date=c('10/01/1997','2/01/1998','5/01/2002','7/01/1999','9/01/2005','5/01/2006','2/03/2003'));
dcis <- grep('date$',names(df1));
df1[dcis] <- lapply(df1[dcis],as.Date,'%m/%d/%Y');
df1;
ID actual.date
1 1 1997-10-01
2 1 1998-02-01
3 1 2002-05-01
4 2 1999-07-01
5 3 2005-09-01
6 4 2006-05-01
7 4 2003-02-03
Run Code Online (Sandbox Code Playgroud)
第二个数据集df2包含两个日期字段,'before,date'和'after.date',分别代表开始日期和结束日期:
df2 <- data.frame(ID=c(1,1,1,2,3,4,4,4), before.date=c('10/1/1996','1/1/1998','1/1/2000','1/1/2001','1/1/2001','1/1/2001','10/1/2004','10/3/2004'), after.date=c('12/1/1996','9/30/2003','12/31/2004','3/31/2006','9/30/2006','9/30/2005','12/30/2004','11/28/2004') );
dcis <- grep('date$',names(df2));
df2[dcis] <- lapply(df2[dcis],as.Date,'%m/%d/%Y');
df2;
ID before.date after.date
1 1 1996-10-01 1996-12-01
2 1 1998-01-01 2003-09-30
3 1 2000-01-01 2004-12-31
4 2 2001-01-01 2006-03-31
5 3 2001-01-01 2006-09-30
6 4 2001-01-01 2005-09-30
7 4 2004-10-01 2004-12-30 …Run Code Online (Sandbox Code Playgroud) 昨晚回答这个问题,我花了一个小时的时间试图找到一个没有data.frame在for循环中成长的解决方案,没有任何成功,所以我很好奇是否有更好的方法来解决这个问题.
问题的一般情况归结为:
data.framesdata.frame可以在另一个中具有0个或更多匹配条目.data.frames 中的多个列对于一个具体的例子,我将使用类似的数据来链接问题:
genes <- data.frame(gene = letters[1:5],
chromosome = c(2,1,2,1,3),
start = c(100, 100, 500, 350, 321),
end = c(200, 200, 600, 400, 567))
markers <- data.frame(marker = 1:10,
chromosome = c(1, 1, 2, 2, 1, 3, 4, 3, 1, 2),
position = c(105, 300, 96, 206, 150, 400, 25, 300, 120, 700))
Run Code Online (Sandbox Code Playgroud)
我们的复杂匹配功能:
# matching criteria, applies to a single entry from each data.frame …Run Code Online (Sandbox Code Playgroud) 我有一个数字元素向量,以及一个带有两列的数据框,用于定义间隔的起点和终点.数据帧中的每一行都是一个间隔.我想找出向量中每个元素属于哪个区间.
这是一些示例数据:
# Find which interval that each element of the vector belongs in
library(tidyverse)
elements <- c(0.1, 0.2, 0.5, 0.9, 1.1, 1.9, 2.1)
intervals <- frame_data(~phase, ~start, ~end,
"a", 0, 0.5,
"b", 1, 1.9,
"c", 2, 2.5)
Run Code Online (Sandbox Code Playgroud)
反对tidyverse的人的相同示例数据:
elements <- c(0.1, 0.2, 0.5, 0.9, 1.1, 1.9, 2.1)
intervals <- structure(list(phase = c("a", "b", "c"),
start = c(0, 1, 2),
end = c(0.5, 1.9, 2.5)),
.Names = c("phase", "start", "end"),
row.names = c(NA, -3L),
class = "data.frame")
Run Code Online (Sandbox Code Playgroud)
这是一种方法:
library(intrval)
phases_for_elements …Run Code Online (Sandbox Code Playgroud)