Emm*_*man 5 r set-intersection dplyr
我想计算集合之间的重叠系数。我的数据是一个 2 列表,例如:
\ndf_example <- \n tibble::tribble(~my_group, ~cities,\n "foo", "london",\n "foo", "paris", \n "foo", "rome", \n "foo", "tokyo",\n "foo", "oslo",\n "bar", "paris", \n "bar", "nyc",\n "bar", "rome", \n "bar", "munich",\n "bar", "warsaw",\n "bar", "sf", \n "baz", "milano",\n "baz", "oslo",\n "baz", "sf", \n "baz", "paris")\nRun Code Online (Sandbox Code Playgroud)\n在 中df_example,我有 3 个集合(即 、foo、bar)baz,每个集合的成员在cities。
我希望最终得到一个与所有可能的集合对相交的表,并指定每对中较小集合的大小。这将导致计算重叠系数每对集合的
\n(重叠系数=共同成员数/较小集合的大小)
\n所需输出
\n## # A tibble: 3 \xc3\x97 4\n## combination n_instersected_members size_of_smaller_set overlap_coeff\n## <chr> <dbl> <dbl> <dbl>\n## 1 foo*bar 2 5 0.4 \n## 2 foo*baz 3 4 0.75\n## 3 bar*baz 2 4 0.5 \nRun Code Online (Sandbox Code Playgroud)\n有没有足够简单的方法来使用dplyr完成此任务动词来完成此任务?我试过了
\ndf_example |> \n group_by(my_group) |> \n summarise(intersected = dplyr::intersect(cities))\nRun Code Online (Sandbox Code Playgroud)\n但这显然行不通,因为dplyr::intersect()需要两个向量。有没有办法获得类似于我的dplyr的所需输出方向的所需输出?
这是一个基本 R 选项,使用combn
do.call(
rbind,
combn(
with(
df_example,
split(cities, my_group)
),
2,
\(x)
transform(
data.frame(
combo = paste0(names(x), collapse = "-"),
nrIntersect = sum(x[[1]] %in% x[[2]]),
szSmallSet = min(lengths(x))
),
olCoeff = nrIntersect / szSmallSet
),
simplify = FALSE
)
)
Run Code Online (Sandbox Code Playgroud)
这使
combo nrIntersect szSmallSet olCoeff
1 bar-baz 2 4 0.5
2 bar-foo 2 5 0.4
3 baz-foo 2 4 0.5
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
125 次 |
| 最近记录: |