HNS*_*SKD 8 performance r data-manipulation dplyr tidyr
我有口服DM药物的患者处方数据,即DPP4和SU,想知道患者是否同时服用了药物(即同一患者中DPP4和SU的间隔是否重叠ID)。
样本数据:
ID DRUG START END
1 1 DPP4 2020-01-01 2020-01-20
2 1 DPP4 2020-03-01 2020-04-01
3 1 SU 2020-03-15 2020-04-30
4 2 SU 2020-10-01 2020-10-31
5 2 DPP4 2020-12-01 2020-12-31
Run Code Online (Sandbox Code Playgroud)
在上面的样本数据中,
ID == 1,患者从有DPP-4和SU同时2020-03-15给2020-04-01。ID == 2,患者以不同的时间间隔服用了两种药物。我想将数据分成 2 个,一个用于 DPP4,另一个用于 SU。然后,进行完全连接,并将每个 DPP4 间隔与每个 SU 间隔进行比较。这对于小数据可能没问题,但如果患者有 5 行 DPP4 和另外 5 行 SU,我们将有 25 次比较,这可能效率不高。加上 10000 多名患者。
我不知道该怎么做。
新数据:
希望有一个看起来像这样的新 df。或者任何整洁的东西。
ID DRUG START END
1 1 DPP4-SU 2020-03-15 2020-04-01
2 2 <NA> <NA> <NA>
Run Code Online (Sandbox Code Playgroud)
数据代码:
df <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L), DRUG = c("DPP4", "DPP4",
"SU", "SU", "DPP4"), START = structure(c(18262, 18322, 18336,
18536, 18597), class = "Date"), END = structure(c(18281, 18353,
18382, 18566, 18627), class = "Date")), class = "data.frame", row.names = c(NA,
-5L))
df_new <- structure(list(ID = 1:2, DRUG = c("DPP4-SU", NA), START = structure(c(18336,
NA), class = "Date"), END = structure(c(18353, NA), class = "Date")), class = "data.frame", row.names = c(NA,
-2L))
Run Code Online (Sandbox Code Playgroud)
编辑: 我认为从我给出的样本数据来看,似乎只能有 1 个相交间隔。但可能还有更多。所以,我认为这将是更好的数据来说明。
structure(list(ID = c(3, 3, 3, 3, 3, 3, 3), DRUG = c("DPP4",
"DPP4", "SU", "SU", "DPP4", "DPP4", "DPP4"), START = structure(c(17004,
17383, 17383, 17418, 17437, 17649, 17676), class = c("IDate",
"Date")), END = structure(c(17039, 17405, 17405, 17521, 17625,
17669, 17711), class = c("IDate", "Date")), duration = c(35L,
22L, 22L, 103L, 188L, 20L, 35L), INDEX = c(1L, 0L, 0L, 0L, 0L,
0L, 0L)), row.names = c(NA, -7L), class = c("tbl_df", "tbl",
"data.frame"))
Run Code Online (Sandbox Code Playgroud)
更新的解决方案
我根据新提供的数据集进行了大量修改。这次我首先为每个START和END对创建间隔并提取它们之间的相交周期。至于亲爱的马丁很好地利用他们的,我们可以使用lubridate::int_start和lubridate::int_end提取START和END每个区间的日期:
library(dplyr)
library(lubridate)
library(purrr)
library(tidyr)
df %>%
group_by(ID) %>%
arrange(START, END) %>%
mutate(int = interval(START, END),
is_over = c(NA, map2(int[-n()], int[-1],
~ intersect(.x, .y)))) %>%
unnest(cols = c(is_over)) %>%
select(-int) %>%
filter(!is.na(is_over) | !is.na(lead(is_over))) %>%
select(!c(START, END)) %>%
mutate(grp = cumsum(is.na(is_over))) %>%
group_by(grp) %>%
summarise(ID = first(ID),
DRUG = paste0(DRUG, collapse = "-"),
is_over = na.omit(is_over)) %>%
mutate(START = int_start(is_over),
END = int_end(is_over)) %>%
select(!is_over)
# A tibble: 1 x 5
grp ID DRUG START END
<int> <int> <chr> <dttm> <dttm>
1 1 1 DPP4-SU 2020-03-15 00:00:00 2020-04-01 00:00:00
Run Code Online (Sandbox Code Playgroud)
第二组数据:
# A tibble: 2 x 5
grp ID DRUG START END
<int> <dbl> <chr> <dttm> <dttm>
1 1 3 DPP4-SU 2017-08-05 00:00:00 2017-08-27 00:00:00
2 2 3 SU-DPP4 2017-09-28 00:00:00 2017-12-21 00:00:00
Run Code Online (Sandbox Code Playgroud)
它比亲爱的@AnoushiravanR 更复杂,但作为替代方案,您可以尝试
library(dplyr)
library(tidyr)
library(lubridate)
df %>%
full_join(x = ., y = ., by = "ID") %>%
# filter(DRUG.x != DRUG.y | START.x != START.y | END.x != END.y) %>%
filter(DRUG.x != DRUG.y) %>%
group_by(ID, intersection = intersect(interval(START.x, END.x), interval(START.y, END.y))) %>%
drop_na(intersection) %>%
filter(START.x == first(START.x)) %>%
summarise(DRUG = paste(DRUG.x, DRUG.y, sep = "-"),
START = as_date(int_start(intersection)),
END = as_date(int_end(intersection)),
.groups = "drop") %>%
select(-intersection)
Run Code Online (Sandbox Code Playgroud)
回来
# A tibble: 1 x 4
ID DRUG START END
<int> <chr> <date> <date>
1 1 DPP4-SU 2020-03-15 2020-04-01
Run Code Online (Sandbox Code Playgroud)
编辑:更改了过滤条件。前一个是有缺陷的。
根据更新 df
df <- structure(list(ID = c(3, 3, 3, 3, 3, 3, 3), DRUG = c(
"DPP4",
"DPP4", "SU", "SU", "DPP4", "DPP4", "DPP4"
), START = structure(c(
17004,
17383, 17383, 17418, 17437, 17649, 17676
), class = c(
"IDate",
"Date"
)), END = structure(c(
17039, 17405, 17405, 17521, 17625,
17669, 17711
), class = c("IDate", "Date")), duration = c(
35L,
22L, 22L, 103L, 188L, 20L, 35L
), INDEX = c(
1L, 0L, 0L, 0L, 0L,
0L, 0L
)), row.names = c(NA, -7L), class = c(
"tbl_df", "tbl",
"data.frame"
))
Run Code Online (Sandbox Code Playgroud)
我们获得
> dfnew
ID DRUG start end
3.3 3 DPP4-SU 2017-08-05 2017-08-27
3.7 3 SU-DPP4 2017-09-28 2017-12-21
Run Code Online (Sandbox Code Playgroud)
一个基本的 R 选项(不像@Anoushiravan R或@Martin Gal的答案那么花哨)
f <- function(d) {
d <- d[with(d, order(START, END)), ]
idx <- subset(
data.frame(which((u <- with(d, outer(START, END, `<`))) & t(u), arr.ind = TRUE)),
row > col
)
if (nrow(idx) == 0) {
return(data.frame(ID = unique(d$ID), DRUG = NA, start = NA, end = NA))
}
with(
d,
do.call(rbind,
apply(
idx,
1,
FUN = function(v) {
data.frame(
ID = ID[v["row"]],
DRUG = paste0(DRUG[sort(unlist(v))], collapse = "-"),
start = START[v["row"]],
end = END[v["col"]]
)
}
))
)
}
dfnew <- do.call(rbind, Map(f, split(df, ~ID)))
Run Code Online (Sandbox Code Playgroud)
给
> dfnew
ID DRUG start end
1 1 DPP4-SU 2020-03-15 2020-04-01
2 2 <NA> <NA> <NA>
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
123 次 |
| 最近记录: |