Eco*_*bie 5 r igraph data.table
我有一个如下所示的数据框:
Name Start_Date End_Date
A 2015-01-01 2019-12-29
A 2017-03-25 NA
A 2019-10-17 NA
A 2012-04-16 2015-01-09
A 2002-06-01 2006-02-01
A 2005-12-24 NA
B 2018-01-23 NA
Run Code Online (Sandbox Code Playgroud)
我想创建一个列,如果两个观察结果相同Name,并且一个Start_Date观察结果与另一个观察结果相差 ±1 年End_Date,则它们被归类为同一组。
期望的输出:
Name Start_Date End_Date Wanted
A 2015-01-01 2019-12-29 1
A 2017-03-25 NA NA
A 2019-10-17 NA 1
A 2012-04-16 2015-01-09 1
A 2002-06-01 2006-02-01 2
A 2005-12-24 NA 2
B 2018-01-23 NA NA
Run Code Online (Sandbox Code Playgroud)
我正在寻找带有数据表的解决方案,但解决我的问题就足够了。
补充: 逐行解释
行:
因此,行1,3和4在相同的组中。行5和6在同一组中。排2和7没有组。
编辑:Wanted当观察与另一个观察不匹配时,我已更新我的代码以具有一致的类别。
这data.table是首选的解决方案:
我更喜欢 data.table 的解决方案,但任何解决方案都非常感谢!
虽然dplyr并且fuzzyjoin可能看起来更优雅,但对于足够大的数据集,它们也可能证明效率较低。
幸得ThomasIsCoding殴打我冲在此的其他问题,有一个答案,运用igraph到指数网络的图表。在这里,网络是Wanted由“链接”(data.frame行)组成的单独“链”(组),它们通过它们的“紧密度”(在它们的Start_Dates 和End_Dates之间)连接起来。这种方法似乎有必要对传递关系建模?在这里请求
我正在尝试创建“关闭”链接链,以便我可以随着时间的推移映射 A 的运动。
还要注意保持 ? (见进一步阅读)。
每个相同的请求
因此,理想情况下,我想标记一个观察的开始日期(2016-01-01)与两个不同的结束日期(2015-01-02 和 2016-12-31)“模糊分组”的情况,反之亦然。
以及您的进一步说明
...我想要另一列指示 [flag]。
我还包括一Flag列,以标记Start_Date与End_Date至少flag_at其他行的s匹配的每一行;或相反亦然。
使用您的示例data.frame,在此处复制为my_data_frame
# Generate dataset as data.frame.
my_data_frame <- structure(list(Name = c("A", "A", "A", "A", "A", "A", "B"),
Start_Date = structure(c(16436, 17250, 18186, 15446, 11839, 13141, 17554),
class = "Date"),
End_Date = structure(c(18259, NA, NA, 16444, 13180, NA, NA),
class = "Date")),
row.names = c(NA, -7L),
class = "data.frame")
Run Code Online (Sandbox Code Playgroud)
我们应用data.table和igraph(在其他包中)如下:
library(tidyverse)
library(data.table)
library(lubridate)
library(igraph)
# ...
# Code to generate your data.frame 'my_data_frame'.
# ...
# Treat dataset as a data.table.
my_data_table <- my_data_frame %>% data.table::as.data.table()
# Define the tolerance threshold as a (lubridate) "period": 1 year.
tolerance <- lubridate::years(1)
# Set the minimum number of matches for an row to be flagged: 2.
flag_at <- 2
#####################################
# BEGIN: Start Indexing the Groups. #
#####################################
# Begin indexing the "chain" (group) to which each "link" (row) belongs:
output <- my_data_table %>%
########################################################
# STEP 1: Link the Rows That Are "Close" to Each Other #
########################################################
# Prepare data.table for JOIN, by adding appropriate helper columns.
.[, `:=`(# Uniquely identify each row (by row number).
ID = .I,
# Boundary columns for tolerance threshold.
End_Low = End_Date - tolerance,
End_High = End_Date + tolerance)] %>%
# JOIN rows to each other, to obtain pairings.
.[my_data_table,
# Clearly describe the relation R: x R y whenever the 'Start_Date' of x is
# close enough to (within the boundary columns for) the 'End_Date' of y.
.(x.ID = i.ID, x.Name = i.Name, x.Start_Date = i.Start_Date, x.End_Date = i.End_Date,
y.End_Low = x.End_Low, y.End_High = x.End_High, y.ID = x.ID, y.Name = x.Name),
# JOIN criteria:
on = .(# Only pair rows having the same name.
Name,
# Only pair rows whose start and end dates are within the tolerance
# threshold of each other.
End_Low <= Start_Date,
End_High >= Start_Date),
# Make it an OUTER JOIN, to include those rows without a match.
nomatch = NA] %>%
# Prepare pairings for network analysis.
.[# Ensure no row is reflexively paired with itself.
# NOTE: This keeps the graph clean by trimming extraneous loops, and it
# prevents an "orphan" row from contributing to its own tally of matches.
!(x.ID == y.ID) %in% TRUE,
# !(x.ID == y.ID) %in% TRUE,
# Simplify the dataset to only the pairings (by ID) of linked rows.
.(from = x.ID, to = y.ID)]
#############################
# PAUSE: Count the Matches. #
#############################
# Count how many times each row has its 'End_Date' matched by a 'Start_Date'.
my_data_table$End_Matched <- output %>%
# Include again the missing IDs for y that were never matched by the JOIN.
.[my_data_table[, .(ID)], on = .(to = ID)] %>%
# For each row y, count every other row x where x R y.
.[, .(Matches = sum(!is.na(from))), by = to] %>%
# Extract the count column.
.$Matches
# Count how many times each row has its 'Start_Date' matched by an 'End_Date'.
my_data_table$Start_Matched <- output %>%
# For each row x, count every other row y where x R y.
.[, .(Matches = sum(!is.na(to))), by = from] %>%
# Extract the count column.
.$Matches
#########################################
# RESUME: Continue Indexing the Groups. #
#########################################
# Resume indexing:
output <- output %>%
# Ignore nonmatches (NAs) which are annoying to process into a graph.
.[from != to, ] %>%
###############################################################
# STEP 2: Index the Separate "Chains" Formed By Those "Links" #
###############################################################
# Convert pairings (by ID) of linked rows into an undirected graph.
igraph::graph_from_data_frame(directed = FALSE) %>%
# Find all groups (subgraphs) of transitively linked IDs.
igraph::components() %>%
# Pair each ID with its group index.
igraph::membership() %>%
# Tabulate those pairings...
utils::stack() %>% utils::type.convert(as.is = TRUE) %>%
# ...in a properly named data.table.
data.table::as.data.table() %>% .[, .(ID = ind, Group_Index = values)] %>%
#####################################################
# STEP 3: Match the Original Rows to their "Chains" #
#####################################################
# LEFT JOIN (on ID) to match each original row to its group index (if any).
.[my_data_table, on = .(ID)] %>%
# Transform output into final form.
.[# Sort into original order.
order(ID),
.(# Select existing columns.
Name, Start_Date, End_Date,
# Rename column having the group indices.
Wanted = Group_Index,
# Calculate column(s) to flag rows with sufficient matches.
Flag = (Start_Matched >= flag_at) | (End_Matched >= flag_at))]
# View results.
output
Run Code Online (Sandbox Code Playgroud)
结果output如下data.table:
Name Start_Date End_Date Wanted Flag
1: A 2015-01-01 2019-12-29 1 FALSE
2: A 2017-03-25 <NA> NA FALSE
3: A 2019-10-17 <NA> 1 FALSE
4: A 2012-04-16 2015-01-09 1 FALSE
5: A 2002-06-01 2006-02-01 2 FALSE
6: A 2005-12-24 <NA> 2 FALSE
7: B 2018-01-23 <NA> NA FALSE
Run Code Online (Sandbox Code Playgroud)
请记住,这些FlagsFALSE只是因为您的数据缺少Start_Date(至少)两个 End_Dates匹配的任何数据;以及任何End_Date匹配的(至少)两个 Start_Dates。
可以想像,如果我们降低flag_at到1,那么output将Flag每行具有甚至一个单一匹配(在任一方向):
Name Start_Date End_Date Wanted Flag
1: A 2015-01-01 2019-12-29 1 TRUE
2: A 2017-03-25 <NA> NA FALSE
3: A 2019-10-17 <NA> 1 TRUE
4: A 2012-04-16 2015-01-09 1 TRUE
5: A 2002-06-01 2006-02-01 2 TRUE
6: A 2005-12-24 <NA> 2 TRUE
7: B 2018-01-23 <NA> NA FALSE
Run Code Online (Sandbox Code Playgroud)
由于某些data.table 操作修改通过参考(或“就地”),的值my_data_table在整个工作流程的改变。在第 1 步之后,my_data_table变成
Name Start_Date End_Date ID End_Low End_High
1: A 2015-01-01 2019-12-29 1 2018-12-29 2020-12-29
2: A 2017-03-25 <NA> 2 <NA> <NA>
3: A 2019-10-17 <NA> 3 <NA> <NA>
4: A 2012-04-16 2015-01-09 4 2014-01-09 2016-01-09
5: A 2002-06-01 2006-02-01 5 2005-02-01 2007-02-01
6: A 2005-12-24 <NA> 6 <NA> <NA>
7: B 2018-01-23 <NA> 7 <NA> <NA>
Run Code Online (Sandbox Code Playgroud)
与my_data_frame最初复制的结构背离。
由于dplyr(在其他包中)是按值而不是按引用分配的,因此dplyr解决方案将完全回避这个问题。
但是,在修改工作流时必须小心,因为my_data_table步骤 1 之前可用的版本之后无法恢复。
尽管s的JOINingdata.table是明确的方向 - 具有“右侧”和“左侧” - 该模型设法保留了您在此处描述的关系对称性
如果... [任何一个] 的“Start_Date”在另一个观察的“End_Date”内为 +- 1 年,则它们被归类为同一组。
通过使用无向图。
当JOIN涉及第1行(有Start_Date中2015-01-01)与第4行(具有End_Date的2015-01-09),我们推测其Start_Date是“足够接近”(1年以内)的End_Date的。所以我们在数学上说?, 或者
"与"在同一组中。
然而,反过来 呢?将不一定出现在JOIN编辑数据,因为Start_Date的可能不会江山如此便利,邻近End_Date的. 也就是说,JOINed 数据不一定表明
“与”在同一组 .
在后一种情况下,严格有向图(“有向图”)不会捕获同一组中的公共成员资格。您可以通过在步骤 2 的第一行中进行设置来观察这种不和谐的差异directed = TRUE
igraph::graph_from_data_frame(directed = TRUE) %>%
Run Code Online (Sandbox Code Playgroud)
并设置mode = "strong"在下一行
igraph::components(mode = "strong") %>%
Run Code Online (Sandbox Code Playgroud)
产生这些分离的结果:
Name Start_Date End_Date Wanted Flag
1: A 2015-01-01 2019-12-29 4 FALSE
2: A 2017-03-25 <NA> NA FALSE
3: A 2019-10-17 <NA> 3 FALSE
4: A 2012-04-16 2015-01-09 5 FALSE
5: A 2002-06-01 2006-02-01 2 FALSE
6: A 2005-12-24 <NA> 1 FALSE
7: B 2018-01-23 <NA> NA FALSE
Run Code Online (Sandbox Code Playgroud)
相比之下,可以通过使用无向图 ( )对行进行适当分组directed = FALSE;或通过更宽松的标准 ( mode = "weak")。这些方法中的任何一种都将有效地模拟 ? 什么时候?存在于JOINed 数据中。
在对您在此处描述的行为进行建模时,此对称属性尤其重要:
...一个观察的开始日期(2016-01-01)被“模糊地分组”为两个不同的结束日期(2015-01-02 和 2016-12-31)...
在这种情况下,您希望模型能够识别出任意两行并且必须在同一组中 (?),只要它们的End_Dates 与Start_Date其他行的s 匹配: ? 和 ?.
那么假设我们知道吗?和 ?. 因为我们的模型保留了对称性,我们可以从 ? 那 ?也。既然我们现在知道了?和 ?,传递性意味着 ? . 因此,我们的模型认识到 ? 什么时候?和 ?!类似的逻辑就足以“反之亦然”。
我们可以通过使用来验证这个结果
my_data_frame <- my_data_frame %>%
rbind(list(Name = "A",
Start_Date = as.Date("2010-01-01"),
End_Date = as.Date("2015-01-05")))
Run Code Online (Sandbox Code Playgroud)
my_data_frame在工作流之前将第 8 行附加到 :
Name Start_Date End_Date
1 A 2015-01-01 2019-12-29
# ? ? ? ?
4 A 2012-04-16 2015-01-09
# ? ? ? ?
8 A 2010-01-01 2015-01-05
Run Code Online (Sandbox Code Playgroud)
这第 8 行用作我们的 ,其中 1st 行和 4th 行,和以前一样。事实上,output正确地将 和 归类为属于同一组1: ? .
Name Start_Date End_Date Wanted Flag
1: A 2015-01-01 2019-12-29 1 TRUE
2: A 2017-03-25 <NA> NA FALSE
3: A 2019-10-17 <NA> 1 FALSE
4: A 2012-04-16 2015-01-09 1 FALSE
5: A 2002-06-01 2006-02-01 2 FALSE
6: A 2005-12-24 <NA> 2 FALSE
7: B 2018-01-23 <NA> NA FALSE
8: A 2010-01-01 2015-01-05 1 FALSE
Run Code Online (Sandbox Code Playgroud)
同样,output正确的Flags 是第一行,Start_Date现在它与两个End_Dates匹配:在第 4 行和第 8 行。