neu*_*zen 2 lookup r data.table
像许多我是R的新手一样.我有一个大型数据集(500M +行),我已经将其放入data.table logStats,其中包含如下数据:
head(logStats,15)
time pid mean
1: 2014-03-10 00:00:00 998 3.570000
2: 2014-03-10 00:00:00 11 4.090000
3: 2014-03-10 00:00:00 345 3.380000
4: 2014-03-10 00:05:00 998 4.866667
5: 2014-03-10 00:05:00 11 3.677778
6: 2014-03-10 00:05:00 345 4.487500
7: 2014-03-10 00:10:00 345 4.833333
8: 2014-03-10 00:10:00 998 4.333333
9: 2014-03-10 00:10:00 11 6.977778
10: 2014-03-10 00:15:00 345 3.900000
11: 2014-03-10 00:15:00 998 3.200000
12: 2014-03-10 00:15:00 11 6.030000
13: 2014-03-10 00:20:00 998 4.550000
14: 2014-03-10 00:20:00 11 4.030000
15: 2014-03-10 00:20:00 345 6.060000
Run Code Online (Sandbox Code Playgroud)
还有第二个非常小的data.table(360行),它有两列将'pid'值解码成更友好的值.'pid'值可以是数字或字符.
例如:
pidLookupTable<-data.table(pid=c(998,11,345),pidName=c("Apple","Bannana","Cinnamon"))
Run Code Online (Sandbox Code Playgroud)
产生:
pid pidName
1: 998 Apple
2: 11 Bannana
3: 345 Cinnamon
Run Code Online (Sandbox Code Playgroud)
我希望一个表达式能够向data.table添加一个列,logStats该列具有pidNamefor该行pid.
我应该得到类似的东西:
time pid mean pidNames
1: 2014-03-10 00:00:00 998 3.570000 Apple
2: 2014-03-10 00:00:00 11 4.090000 Banana
3: 2014-03-10 00:00:00 345 3.380000 Cinnamon
4: 2014-03-10 00:05:00 998 4.866667 Apple
5: 2014-03-10 00:05:00 11 3.677778 Banana
6: 2014-03-10 00:05:00 345 4.487500 Cinnamon
7: 2014-03-10 00:10:00 345 4.833333 Cinnamon
8: 2014-03-10 00:10:00 998 4.333333 Apple
9: 2014-03-10 00:10:00 11 6.977778 Banana
10: 2014-03-10 00:15:00 345 3.900000 Cinnamon
11: 2014-03-10 00:15:00 998 3.200000 Apple
12: 2014-03-10 00:15:00 11 6.030000 Banana
13: 2014-03-10 00:20:00 998 4.550000 Apple
14: 2014-03-10 00:20:00 11 4.030000 Banana
15: 2014-03-10 00:20:00 345 6.060000 Cinnamon
Run Code Online (Sandbox Code Playgroud)
我写了一个函数:
pidNameLookup<-function(x) {
return(pidLookupTable[pidLookupTable$pid==x,name])
}
Run Code Online (Sandbox Code Playgroud)
然后跑:
logStats[,pidName:=pidNameLookup(pid)]
Run Code Online (Sandbox Code Playgroud)
但是这只会NA为剩下的值转换前3个看跌期权:
logStats[1:1000]
date time pid value timestamp mean pidName
1: 10-03-2014 00:00:12 998 5.5 2014-03-10 00:00:12 3.57 Apple
2: 10-03-2014 00:00:17 11 2.1 2014-03-10 00:00:17 4.09 Bannana
3: 10-03-2014 00:00:22 345 5.7 2014-03-10 00:00:22 3.38 Cinnamon
4: 10-03-2014 00:00:47 998 1.0 2014-03-10 00:00:47 3.57 NA
5: 10-03-2014 00:00:55 11 0.3 2014-03-10 00:00:55 4.09 NA
---
996: 10-03-2014 02:49:37 345 0.7 2014-03-10 02:49:37 5.30 NA
997: 10-03-2014 02:50:01 998 9.9 2014-03-10 02:50:01 5.30 NA
998: 10-03-2014 02:50:08 11 7.0 2014-03-10 02:50:08 7.00 NA
999: 10-03-2014 02:50:18 345 2.4 2014-03-10 02:50:18 2.40 NA
1000: 10-03-2014 02:50:48 998 0.7 2014-03-10 02:50:48 5.30 NA
Run Code Online (Sandbox Code Playgroud)
并给我一个警告信息:
Warning message:
In pidLookupTable$pid == x
longer object length is not a multiple of shorter object length
Run Code Online (Sandbox Code Playgroud)
警告消息和不正确的结果意味着我正在做一些完全错误的事情.
救命!!这让我很精神
我建议你看看data.table(vignette("datatable-intro"))的介绍插图,因为这data.table是明确构建的东西.
这将为您提供您想要的,并且应该更快,更快:
setkey(logStats, "pid")
setkey(pidLookupTable, "pid")
logStats[pidLookupTable]
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
2481 次 |
| 最近记录: |