R data.table平均值,如果使用联接查找

Lau*_*_jj 0 r rmysql data.table

我想做的只是一个简单的均值,如果(就像在Excel中的命令平均值一样)。我正在使用data.tables来提高效率,因为我有相当大的表(〜1m行)。

我的目的是查找

Table 1 
| individual id | date        |
-------------------------------
| 1             |  2018-01-02 |
| 1             |  2018-01-03 |
| 2             |  2018-01-02 |
| 2             |  2018-01-03 |

Table 2 
| individual id | date2       | alpha |
---------------------------------------
| 1             |  2018-01-02 |  1    |  
| 1             |  2018-01-04 |  1.5  |
| 1             |  2018-01-05 |  1    |
| 2             |  2018-01-01 |  2    |  
| 2             |  2018-01-02 |  1    |
| 2             |  2018-01-05 |  4    |
Run Code Online (Sandbox Code Playgroud)

Target result

Updated table 1
| individual id | date        | mean(alpha) |
---------------------------------------------
| 1             |  2018-01-02 |  1          |
| 1             |  2018-01-03 |  1          |
| 2             |  2018-01-02 | 1.5         |
| 2             |  2018-01-03 | 1.5         |
Run Code Online (Sandbox Code Playgroud)

This is simply the mean of all the values for this individual in table2, that occurred (date2) prior to (and including) the date. The result can be produced by the following mysql command, but I am unable to reproduce it in R.

update table1
            set daily_alpha_avg = 
      (select avg(case when date2<date then alpha else 0 end) 
      from table2
      where table2.individual_id= table1.individual_id
      group by individual_id);
Run Code Online (Sandbox Code Playgroud)

My best guess so far is:

table1[table2, on = .(individual_id, date>=date2), 
          .(x.individual_id, x.date, bb = mean(alpha)), by= .(x.date, x.individual_id)]
Run Code Online (Sandbox Code Playgroud)

or

table1[, daily_alpha_avg := table2[table1, mean(alpha), on =.(individual_id, date>=date2)]]
Run Code Online (Sandbox Code Playgroud)

but this isnt working, I know its wrong I just dont know how to fix it.

Thanks for any help

sin*_*dur 5

使用by = .EACHI您可以执行以下操作:

table2[table1, 
       on = .(`individual id`), 
       .(date = i.date, mean_alpha = mean(alpha[date2 <= i.date])),
       by = .EACHI]

#    individual id       date mean_alpha
# 1:             1 2018-01-02        1.0
# 2:             1 2018-01-03        1.0
# 3:             2 2018-01-02        1.5
# 4:             2 2018-01-03        1.5
Run Code Online (Sandbox Code Playgroud)

编辑:

# Assign by reference as a new column
table1[, mean_alpha := table2[table1, 
                              on = .(`individual id`), 
                              mean(alpha[date2 <= i.date]),
                              by = .EACHI][["V1"]]]
Run Code Online (Sandbox Code Playgroud)

编辑2

这是弗兰克在评论部分建议的更优雅的方法。

# In this solution our date columns can't be type character
table1[, date := as.Date(date)]
table2[, date2 := as.Date(date2)]

table1[, mean_alpha := table2[table1, # or equivalently .SD instead of table1
                              on = .(`individual id`, date2 <= date), 
                              mean(alpha), 
                              by = .EACHI][["V1"]]]
Run Code Online (Sandbox Code Playgroud)

可复制的数据

table1 <- fread(
  "individual id | date       
   1             |  2018-01-02
   1             |  2018-01-03
   2             |  2018-01-02
   2             |  2018-01-03", 
  sep ="|"
)
table2 <- fread(
  "individual id | date2       | alpha
   1             |  2018-01-02 |  1     
   1             |  2018-01-04 |  1.5 
   1             |  2018-01-05 |  1   
   2             |  2018-01-01 |  2     
   2             |  2018-01-02 |  1   
   2             |  2018-01-05 |  4",
  sep = "|"
)
Run Code Online (Sandbox Code Playgroud)

  • 如果您使用`as.IDate`或`as.Date`覆盖转换日期,则`table1 [,v:= table2 [.SD,on =。(\`个人ID \`,date2 &lt;=日期), mean(alpha),by = .EACHI] $ V1]`也可以。顺便说一句,好答案:) (3认同)