相关疑难解决方法(0)

基于多列中直接和间接相似性对变量进行分组的快速方法

我有一个相对较大的数据集(1,750,000行,5列),其中包含具有唯一ID值的记录(第一列),由四个条件(其他4列)描述。一个小例子是:

# example
library(data.table)
dt <- data.table(id=c("a1","b3","c7","d5","e3","f4","g2","h1","i9","j6"), 
                 s1=c("a","b","c","l","l","v","v","v",NA,NA), 
                 s2=c("d","d","e","k","k","o","o","o",NA,NA),
                 s3=c("f","g","f","n","n","s","r","u","w","z"),
                 s4=c("h","i","j","m","m","t","t","t",NA,NA))
Run Code Online (Sandbox Code Playgroud)

看起来像这样:

   id   s1   s2 s3   s4
 1: a1    a    d  f    h
 2: b3    b    d  g    i
 3: c7    c    e  f    j
 4: d5    l    k  n    m
 5: e3    l    k  n    m
 6: f4    v    o  s    t
 7: g2    v    o  r    t
 8: h1    v    o  u    t
 9: i9 <NA> <NA>  w <NA>
10: j6 <NA> <NA>  z <NA>
Run Code Online (Sandbox Code Playgroud)

我的最终目标是在任何描述列上查找所有具有相同字符的记录(不考虑NA),并将它们分组为新的ID,以便我可以轻松识别重复的记录。这些ID是通过串联每行的ID来构造的。 …

optimization loops r grepl data.table

13
推荐指数
2
解决办法
238
查看次数

根据多个列创建group_indices

我想基于两列生成索引以对观察进行分组.但是我希望小组能够通过观察来共享,至少有一个公共观察.我可以看到如何根据共同观察的观察结果制作小组,而不仅仅是其中一个.

例如,使用数据框:

dt <- data.frame(id=1:10,
             G1 = c("A","A","B","B","C","C","C","D","E","F"),
             G2 = c("Z","X","X","Y","W","V","U","s","T","T"))
Run Code Online (Sandbox Code Playgroud)

我想得到一个专栏

1,1,1,1,2,2,2,3,4,4
Run Code Online (Sandbox Code Playgroud)

我尝试使用dplyr中的group_indices,但还没有管理它.

r dplyr

12
推荐指数
1
解决办法
566
查看次数

标签 统计

r ×2

data.table ×1

dplyr ×1

grepl ×1

loops ×1

optimization ×1