data.table查找值和翻译

neu*_*zen 2 lookup r data.table

像许多我是R的新手一样.我有一个大型数据集(500M +行),我已经将其放入data.table logStats,其中包含如下数据:

 head(logStats,15)

                   time   pid   mean
 1: 2014-03-10 00:00:00   998 3.570000
 2: 2014-03-10 00:00:00   11 4.090000
 3: 2014-03-10 00:00:00   345 3.380000
 4: 2014-03-10 00:05:00   998 4.866667
 5: 2014-03-10 00:05:00   11 3.677778
 6: 2014-03-10 00:05:00   345 4.487500
 7: 2014-03-10 00:10:00   345 4.833333
 8: 2014-03-10 00:10:00   998 4.333333
 9: 2014-03-10 00:10:00   11 6.977778
10: 2014-03-10 00:15:00   345 3.900000
11: 2014-03-10 00:15:00   998 3.200000
12: 2014-03-10 00:15:00   11 6.030000
13: 2014-03-10 00:20:00   998 4.550000
14: 2014-03-10 00:20:00   11 4.030000
15: 2014-03-10 00:20:00   345 6.060000 
Run Code Online (Sandbox Code Playgroud)

还有第二个非常小的data.table(360行),它有两列将'pid'值解码成更友好的值.'pid'值可以是数字或字符.

例如:

pidLookupTable<-data.table(pid=c(998,11,345),pidName=c("Apple","Bannana","Cinnamon"))
Run Code Online (Sandbox Code Playgroud)

产生:

   pid  pidName
1: 998    Apple
2:  11  Bannana
3: 345 Cinnamon
Run Code Online (Sandbox Code Playgroud)

我希望一个表达式能够向data.table添加一个列,logStats该列具有pidNamefor该行pid.

我应该得到类似的东西:

                   time pid     mean pidNames
 1: 2014-03-10 00:00:00   998 3.570000 Apple
 2: 2014-03-10 00:00:00   11 4.090000 Banana
 3: 2014-03-10 00:00:00   345 3.380000 Cinnamon
 4: 2014-03-10 00:05:00   998 4.866667 Apple
 5: 2014-03-10 00:05:00   11 3.677778 Banana
 6: 2014-03-10 00:05:00   345 4.487500 Cinnamon
 7: 2014-03-10 00:10:00   345 4.833333 Cinnamon
 8: 2014-03-10 00:10:00   998 4.333333 Apple
 9: 2014-03-10 00:10:00   11 6.977778 Banana
10: 2014-03-10 00:15:00   345 3.900000 Cinnamon
11: 2014-03-10 00:15:00   998 3.200000 Apple
12: 2014-03-10 00:15:00   11 6.030000 Banana
13: 2014-03-10 00:20:00   998 4.550000 Apple
14: 2014-03-10 00:20:00   11 4.030000 Banana
15: 2014-03-10 00:20:00   345 6.060000  Cinnamon
Run Code Online (Sandbox Code Playgroud)

我写了一个函数:

pidNameLookup<-function(x) { 
  return(pidLookupTable[pidLookupTable$pid==x,name]) 
}
Run Code Online (Sandbox Code Playgroud)

然后跑:

logStats[,pidName:=pidNameLookup(pid)]
Run Code Online (Sandbox Code Playgroud)

但是这只会NA为剩下的值转换前3个看跌期权:

   logStats[1:1000]
               date     time pid value           timestamp mean  pidName
      1: 10-03-2014 00:00:12 998   5.5 2014-03-10 00:00:12 3.57    Apple
      2: 10-03-2014 00:00:17  11   2.1 2014-03-10 00:00:17 4.09  Bannana
      3: 10-03-2014 00:00:22 345   5.7 2014-03-10 00:00:22 3.38 Cinnamon
      4: 10-03-2014 00:00:47 998   1.0 2014-03-10 00:00:47 3.57       NA
      5: 10-03-2014 00:00:55  11   0.3 2014-03-10 00:00:55 4.09       NA
      ---                                                                
      996: 10-03-2014 02:49:37 345   0.7 2014-03-10 02:49:37 5.30       NA
      997: 10-03-2014 02:50:01 998   9.9 2014-03-10 02:50:01 5.30       NA
      998: 10-03-2014 02:50:08  11   7.0 2014-03-10 02:50:08 7.00       NA
      999: 10-03-2014 02:50:18 345   2.4 2014-03-10 02:50:18 2.40       NA
     1000: 10-03-2014 02:50:48 998   0.7 2014-03-10 02:50:48 5.30       NA 
Run Code Online (Sandbox Code Playgroud)

并给我一个警告信息:

Warning message:
In pidLookupTable$pid == x 
  longer object length is not a multiple of shorter object length
Run Code Online (Sandbox Code Playgroud)

警告消息和不正确的结果意味着我正在做一些完全错误的事情.

救命!!这让我很精神

Sco*_*hie 7

我建议你看看data.table(vignette("datatable-intro"))的介绍插图,因为这data.table是明确构建的东西.

这将为您提供您想要的,并且应该更快,更快:

setkey(logStats, "pid")
setkey(pidLookupTable, "pid")
logStats[pidLookupTable]
Run Code Online (Sandbox Code Playgroud)