Ina*_*Ina 5 r dataframe data.table
对于R中的data.table(或data.frame),我希望找到包含"value"列中值的所有行,这些行是与具有相同键的行中另一个值的给定距离"距离".所以,鉴于以下内容:
distance <- 22
key value
A 1
B 1
C 1
D 1
A 4
B 4
A 23
B 23
B 26
B 26
C 30
Run Code Online (Sandbox Code Playgroud)
我想对原始表进行注释,并计算相同键存在多少行,以及一个+22的值:
key value count
A 1 1
B 1 1
C 1 0
D 1 0
A 4 0
B 4 2
A 23 0
B 23 0
B 26 0
B 26 0
C 30 0
Run Code Online (Sandbox Code Playgroud)
我真的不知道从哪里开始使用这种自我引用的方法来操纵R中的数据.我最初的尝试涉及创建第二个表并试图与之匹配,但这似乎是一种奇怪而糟糕的方法.
注意:我正在使用该data.table软件包,但我很乐意在这种情况下使用data.frame工作,如果这样可以使事情变得更容易.
可重复性:
require(data.table)
source <- data.table(data.frame(key=c("A","B","C","D","A","B","A","B","B","B", "C"),value=c(1,1,1,1,4,4,23,23,26,26,30)))
result <- data.table(data.frame(key=c("A","B","C","D","A","B","A","B","B","B","C"),value=c(1,1,1,1,4,4,23,23,26,26,30),count=c(1,1,0,0,0,2,0,0,0,0,0)))
Run Code Online (Sandbox Code Playgroud)
这是一个data.table基础的解决方案.我有兴趣了解可以对它进行哪些改进(如果有的话).
# Your code
library(data.table)
source <-
data.table(data.frame(key = c("A","B","C","D","A","B","A","B","B","B", "C"),
value = c(1,1,1,1,4,4,23,23,26,26,30)))
Run Code Online (Sandbox Code Playgroud)
奇怪的data.table(data.frame(...是因为data.table()有一个被称为的论点key.这是创建一个data.table名为的列的一种方法"key".大写以避免参数名称冲突说明了更标准的语法:
source <- data.table(Key = c("A","B","C","D","A","B","A","B","B","B","C"),
Value = c(1,1,1,1,4,4,23,23,26,26,30))
Run Code Online (Sandbox Code Playgroud)
接下来为了避免as.integer()以后需要,我们将Value列的类型从现在numeric改为integer.记住比1是numeric在R,它1L是integer.这通常是为了提高效率,以更好地存储integer数据integer,比integer如numeric.下一行比输入L上面的许多内容更容易.
source[,Value:=as.integer(Value)] # change type from `numeric` to `integer`
Run Code Online (Sandbox Code Playgroud)
现在继续
distance <- 22L
setkey(source, Key, Value)
# Heart of the solution (following a few explanatory comments):
# "J()" : shorthand for 'data.table()'
# ".N" : returns the number of rows that matched a line (see ?data.table)
# "[[3]]" : as with simple data.frames, extracts the vector in column 3
source[,count:=source[J(Key,Value+distance),.N][[3]]]
source
key value count
[1,] A 1 1
[2,] A 4 0
[3,] A 23 0
[4,] B 1 1
[5,] B 4 2
[6,] B 23 0
[7,] B 26 0
[8,] B 26 0
[9,] C 1 0
[10,] C 30 0
[11,] D 1 0
Run Code Online (Sandbox Code Playgroud)
请注意,直接通过引用:=更改source,这就是它.但setkey()也改变了原始数据的顺序.如果需要保留原始订单,则:
source <- data.table(Key = c("A","B","C","D","A","B","A","B","B","B","C"),
Value = c(1,1,1,1,4,4,23,23,26,26,30))
source[,Value:=as.integer(Value)]
source[,count:=setkey(copy(source))[source[,list(Key,Value+distance)],.N][[3]]]
Key Value count
[1,] A 1 1
[2,] B 1 1
[3,] C 1 0
[4,] D 1 0
[5,] A 4 0
[6,] B 4 2
[7,] A 23 0
[8,] B 23 0
[9,] B 26 0
[10,] B 26 0
[11,] C 30 0
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
876 次 |
| 最近记录: |