过滤data.table中的重复/非唯一行

Dav*_*agh 68 r duplicates data.table

我有一张data.table约250万行的表.有两列.我想删除两列中重复的任何行.以前对于data.frame,我会这样做: data.table但这不适用于data.table.我试过df -> unique(df[,c('V1', 'V2')])但它似乎仍然只在data.table的键上操作而不是整行.

有什么建议?

干杯,戴维

>dt
      V1   V2
[1,]  A    B
[2,]  A    C
[3,]  A    D
[4,]  A    B
[5,]  B    A
[6,]  C    D
[7,]  C    D
[8,]  E    F
[9,]  G    G
[10,] A    B
Run Code Online (Sandbox Code Playgroud)

在上面的data.table中unique(df[,c(V1,V2), with=FALSE]),表键是哪里,只删除行4,7和10.

> dput(dt)
structure(list(V1 = c("B", "A", "A", "A", "A", "A", "C", "C", 
"E", "G"), V2 = c("A", "B", "B", "B", "C", "D", "D", "D", "F", 
"G")), .Names = c("V1", "V2"), row.names = c(NA, -10L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x7fb4c4804578>, sorted = "V2")
Run Code Online (Sandbox Code Playgroud)

And*_*rie 84

在v1.9.8之前

从中?unique.data.table可以看出,调用?unique.data.frame数据表仅适用于密钥.这意味着您必须在调用之前将密钥重置为所有列by.

unique(dt)
   V1 V2
1:  A  B
2:  A  C
3:  A  D
4:  B  A
5:  C  D
6:  E  F
7:  G  G
Run Code Online (Sandbox Code Playgroud)

?unique.data.table使用一列作为键调用:

unique(dt, by = "V2")
   V1 V2
1:  A  B
2:  A  C
3:  A  D
4:  B  A
5:  E  F
6:  G  G
Run Code Online (Sandbox Code Playgroud)

对于v1.9.8 +

From unique 默认情况下,正在使用所有列(与此一致unique)

library(data.table)
dt <- data.table(
  V1=LETTERS[c(1,1,1,1,2,3,3,5,7,1)],
  V2=LETTERS[c(2,3,4,2,1,4,4,6,7,2)]
)
Run Code Online (Sandbox Code Playgroud)

或者使用unique参数以获得特定列的唯一组合(如之前使用的键一样)

setkey(dt, "V2")
unique(dt)
     V1 V2
[1,]  B  A
[2,]  A  B
[3,]  A  C
[4,]  A  D
[5,]  E  F
[6,]  G  G
Run Code Online (Sandbox Code Playgroud)

  • @Andrie这个解决方案不再有效,正如@PeterPan指出的那样.`data.table`不再考虑键中的`unique()`.现在必须使用选项`unique(,by = c(keys))`. (13认同)
  • 让我们知道altabq是正确的,键中的东西必须用引号括起来.所以你想要唯一的(dt,by = c("V1","V2"))作为你的答案. (3认同)
  • 这仅在未设置密钥时才有效。我将编辑上面的问题以说明这一点。对不起 (2认同)
  • 正如akrun在这里回答:http://stackoverflow.com/questions/40949023/r-somehow-unique-is-not-working-for-my-data-table第一个版本现在需要一个by =选项来工作 (2认同)

dnl*_*rky 7

用你的示例data.table ...

> dt<-data.table(V1 = c("B", "A", "A", "A", "A", "A", "C", "C", "E", "G"), V2 = c("A", "B", "B", "B", "C", "D", "D", "D", "F", "G"))
> setkey(dt,V2)
Run Code Online (Sandbox Code Playgroud)

考虑以下测试:

> haskey(dt) # obviously dt has a key, since we just set it
[1] TRUE

> haskey(dt[,list(V1,V2)]) # ... but this is treated like a "new" table, and does not have a key
[1] FALSE

> haskey(dt[,.SD]) # note that this still has a key
[1] TRUE
Run Code Online (Sandbox Code Playgroud)

因此,您可以列出表的列,然后使用unique()它,而不需要将键设置为所有列或将其删除(通过将其设置为NULL)来自@Andrie解决方案的要求(并由@MatthewDowle编辑) ).@Pop和@Rahul建议的解决方案对我不起作用.

请参阅下面的尝试3,这与您最初的尝试非常相似.你的例子不清楚,所以我不确定为什么它不起作用.也就是几个月前你发布这个问题,所以也许data.table更新了?

> unique(dt) # Try 1: wrong answer (missing V1=C and V2=D)
   V1 V2
1:  B  A
2:  A  B
3:  A  C
4:  A  D
5:  E  F
6:  G  G

> dt[!duplicated(dt)] # Try 2: wrong answer (missing V1=C and V2=D)
   V1 V2
1:  B  A
2:  A  B
3:  A  C
4:  A  D
5:  E  F
6:  G  G

> unique(dt[,list(V1,V2)]) # Try 3: correct answer; does not require modifying key
   V1 V2
1:  B  A
2:  A  B
3:  A  C
4:  A  D
5:  C  D
6:  E  F
7:  G  G

> setkey(dt,NULL)
> unique(dt) # Try 4: correct answer; requires key to be removed
   V1 V2
1:  B  A
2:  A  B
3:  A  C
4:  A  D
5:  C  D
6:  E  F
7:  G  G
Run Code Online (Sandbox Code Playgroud)

  • 也许一个新的`unique(...,use.key = FALSE)`参数会有所帮助; 现在提交为[FR#2483](https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2483&group_id=240&atid=978). (3认同)