跨多个列应用函数

SJD*_*JDS 1 r data.table

请在这里找到我正在使用的长数据的一小部分

dput(dt)
structure(list(id = 1:15, pnum = c(4298390L, 4298390L, 4298390L, 
    4298558L, 4298558L, 4298559L, 4298559L, 4299026L, 4299026L, 4299026L, 
    4299026L, 4300436L, 4300436L, 4303566L, 4303566L), invid = c(15L, 
    101L, 102L, 103L, 104L, 103L, 104L, 106L, 107L, 108L, 109L, 87L, 
    111L, 2L, 60L), fid = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 
    4L, 4L, 4L, 4L, 3L, 3L, 2L, 2L), .Label = c("CORN", "DowCor", 
    "KIM", "Texas"), class = "factor"), dom_kn = c(1L, 0L, 0L, 0L, 
    1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 1L), prim_kn = c(1L, 
    0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L), pat_kn = c(1L, 
    0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L), net_kn = c(1L, 
    0L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 1L), age_kn = c(1L, 
    0L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L), legclaims = c(5L, 
    0L, 0L, 2L, 5L, 2L, 5L, 0L, 0L, 0L, 0L, 5L, 0L, 5L, 2L), n_inv = c(3L, 
    3L, 3L, 2L, 2L, 2L, 2L, 4L, 4L, 4L, 4L, 2L, 2L, 2L, 2L)), .Names = c("id", 
    "pnum", "invid", "fid", "dom_kn", "prim_kn", "pat_kn", "net_kn", 
    "age_kn", "legclaims", "n_inv"), class = "data.frame", row.names = c(NA, 
    -15L))
Run Code Online (Sandbox Code Playgroud)

我希望在5个不同的列中应用调整更大的调整.

在每个pnum(专利)中,有多个invid(发明者).我要比较的列的值dom_kn,prim_kn,pat_kn,net_kn,并age_kn与同其他行每行,到了值pnum.比较简单>,如果值确实大于另一个,则应归因于一个"点".

因此,对于第一行pnum == 4298390invid == 15,你可以看到在五列的值都是1,而对于值invid == 101 | 102均为零.这意味着如果我们将第一行中的每个值单独比较(大于?)第二行和第三行中的每个单元格,则总和将为10个点.在每个比较中,第一行中的值更大,并且有10个比较.比较的数量是设计的5 * (n_inv -1).我正在寻找第1行的结果应该是10 / 10 = 1.

对于pnum == 4298558列而言net_kn,age_kn两行中的值均为1(对于invid103和104),因此每个应该得到0.5分(如果有三个发明者的值为1,则每个人应该得到0.33分).同样的道理pnum == 4298558.

对于下一个,pnum == 4299026所有值都为零,因此每次比较都应该得到0分.

因此,请注意区别:有三种不同的二元比较

1 > 0 --> assign 1
1 = 1 --> assign 1 / number of positive values in column subset
0 = 0 --> assign 0
Run Code Online (Sandbox Code Playgroud)

所需结果result data.table中带有值 的额外列1 0 0 0.2 0.8 0.2 0.8 0 0 0 0 1 0 0.8 0.2

有关如何有效计算这一点的任何建议?

谢谢!

edd*_*ddi 6

vars = grep('_kn', names(dt), value = T)

# all you need to do is simply assign the correct weight and sum the numbers up
dt[, res := 0]
for (var in vars)
  dt[, res := res + get(var) / .N, by = c('pnum', var)]

# normalize
dt[, res := res/sum(res), by = pnum]
#    id    pnum invid    fid dom_kn prim_kn pat_kn net_kn age_kn legclaims n_inv res
# 1:  1 4298390    15   CORN      1       1      1      1      1         5     3 1.0
# 2:  2 4298390   101   CORN      0       0      0      0      0         0     3 0.0
# 3:  3 4298390   102   CORN      0       0      0      0      0         0     3 0.0
# 4:  4 4298558   103 DowCor      0       0      0      1      1         2     2 0.2
# 5:  5 4298558   104 DowCor      1       1      1      1      1         5     2 0.8
# 6:  6 4298559   103 DowCor      0       0      0      1      1         2     2 0.2
# 7:  7 4298559   104 DowCor      1       1      1      1      1         5     2 0.8
# 8:  8 4299026   106  Texas      0       0      0      0      0         0     4 NaN
# 9:  9 4299026   107  Texas      0       0      0      0      0         0     4 NaN
#10: 10 4299026   108  Texas      0       0      0      0      0         0     4 NaN
#11: 11 4299026   109  Texas      0       0      0      0      0         0     4 NaN
#12: 12 4300436    87    KIM      1       1      1      1      1         5     2 1.0
#13: 13 4300436   111    KIM      0       0      0      0      0         0     2 0.0
#14: 14 4303566     2 DowCor      1       1      1      1      1         5     2 0.8
#15: 15 4303566    60 DowCor      1       0      0      1      0         2     2 0.2
Run Code Online (Sandbox Code Playgroud)

处理上述NaN情况(可以说是正确的答案)留待读者阅读.