查找列中值之间给定差异的行

Ina*_*Ina 5 r dataframe data.table

对于R中的data.table(或data.frame),我希望找到包含"value"列中值的所有行,这些行是与具有相同键的行中另一个值的给定距离"距离".所以,鉴于以下内容:

distance <- 22
   key value
   A     1
   B     1
   C     1
   D     1
   A     4
   B     4
   A    23
   B    23
   B    26
   B    26
   C    30
Run Code Online (Sandbox Code Playgroud)

我想对原始表进行注释,并计算相同键存在多少行,以及一个+22的值:

  key value count
  A     1     1
  B     1     1
  C     1     0
  D     1     0
  A     4     0
  B     4     2
  A    23     0
  B    23     0
  B    26     0
  B    26     0
  C    30     0
Run Code Online (Sandbox Code Playgroud)

我真的不知道从哪里开始使用这种自我引用的方法来操纵R中的数据.我最初的尝试涉及创建第二个表并试图与之匹配,但这似乎是一种奇怪而糟糕的方法.

注意:我正在使用该data.table软件包,但我很乐意在这种情况下使用data.frame工作,如果这样可以使事情变得更容易.

可重复性:

require(data.table)
source <- data.table(data.frame(key=c("A","B","C","D","A","B","A","B","B","B", "C"),value=c(1,1,1,1,4,4,23,23,26,26,30)))
result <- data.table(data.frame(key=c("A","B","C","D","A","B","A","B","B","B","C"),value=c(1,1,1,1,4,4,23,23,26,26,30),count=c(1,1,0,0,0,2,0,0,0,0,0)))
Run Code Online (Sandbox Code Playgroud)

Jos*_*ien 5

这是一个data.table基础的解决方案.我有兴趣了解可以对它进行哪些改进(如果有的话).

# Your code
library(data.table)
source <- 
data.table(data.frame(key = c("A","B","C","D","A","B","A","B","B","B", "C"),
                      value = c(1,1,1,1,4,4,23,23,26,26,30)))
Run Code Online (Sandbox Code Playgroud)

奇怪的data.table(data.frame(...是因为data.table()有一个被称为的论点key.这是创建一个data.table名为的列的一种方法"key".大写以避免参数名称冲突说明了更标准的语法:

source <- data.table(Key = c("A","B","C","D","A","B","A","B","B","B","C"),
                     Value = c(1,1,1,1,4,4,23,23,26,26,30))
Run Code Online (Sandbox Code Playgroud)

接下来为了避免as.integer()以后需要,我们将Value列的类型从现在numeric改为integer.记住比1numeric在R,它1Linteger.这通常是为了提高效率,以更好地存储integer数据integer,比integernumeric.下一行比输入L上面的许多内容更容易.

source[,Value:=as.integer(Value)]   # change type from `numeric` to `integer`
Run Code Online (Sandbox Code Playgroud)

现在继续

distance <- 22L
setkey(source, Key, Value)

# Heart of the solution (following a few explanatory comments):
#  "J()"   : shorthand for 'data.table()'
#  ".N"    : returns the number of rows that matched a line (see ?data.table)
#  "[[3]]" : as with simple data.frames, extracts the vector in column 3

source[,count:=source[J(Key,Value+distance),.N][[3]]]
source
      key value count
 [1,]   A     1     1
 [2,]   A     4     0
 [3,]   A    23     0
 [4,]   B     1     1
 [5,]   B     4     2
 [6,]   B    23     0
 [7,]   B    26     0
 [8,]   B    26     0
 [9,]   C     1     0
[10,]   C    30     0
[11,]   D     1     0
Run Code Online (Sandbox Code Playgroud)

请注意,直接通过引用:=更改source,这就是它.但setkey()也改变了原始数据的顺序.如果需要保留原始订单,则:

source <- data.table(Key = c("A","B","C","D","A","B","A","B","B","B","C"),
                     Value = c(1,1,1,1,4,4,23,23,26,26,30))
source[,Value:=as.integer(Value)]   
source[,count:=setkey(copy(source))[source[,list(Key,Value+distance)],.N][[3]]]

      Key Value count
 [1,]   A     1     1
 [2,]   B     1     1
 [3,]   C     1     0
 [4,]   D     1     0
 [5,]   A     4     0
 [6,]   B     4     2
 [7,]   A    23     0
 [8,]   B    23     0
 [9,]   B    26     0
[10,]   B    26     0
[11,]   C    30     0
Run Code Online (Sandbox Code Playgroud)

  • 当然.我刚刚在代码中添加了一些注释,它们开始解压data.table调用的紧凑语法. (3认同)