我是R的新手data.table,我发现它很有用而且很快.我正在尝试加入2个数据表:
> TotFreq
Legacy_Store_Number WeekDay Date Item_Key Distr NoSellingDays meanUnits ItemType
1: 113802 1 2013-03-24 000000000120 2.428985e-04 0 8.00 FM
2: 113802 1 2013-03-24 000000000126 1.104030e-03 0 47.50 FM
3: 113802 1 2013-03-24 000000000170 1.126004e-03 0 48.75 FM
4: 113802 1 2013-03-24 000000000180 5.143034e-04 0 19.00 FM
5: 113802 1 2013-03-24 000000000260 3.854306e-04 0 12.25 FM
160167: 113802 7 2013-03-23 978125002327 5.902655e-07 27 1.00 SM
160168: 113802 7 2013-03-23 978141970584 1.770796e-06 25 1.00 SM
160169: 113802 7 2013-03-23 978145300697 1.180531e-06 26 1.00 SM
160170: 113802 7 2013-03-23 978145552558 5.902655e-07 27 1.00 SM
160171: 113802 7 2013-03-23 978160139536 5.902655e-07 27 1.00 SM
> Count_SM_FM
Legacy_Store_Number WeekDay ItemType ObjItems
1: 113802 1 SM 12305
2: 113802 1 FM 1942
3: 113802 2 SM 11014
4: 113802 2 FM 1398
5: 113802 3 SM 10154
6: 113802 3 FM 1117
7: 113802 4 SM 10414
8: 113802 4 FM 1167
9: 113802 5 SM 10258
10: 113802 5 FM 1200
11: 113802 6 SM 11116
12: 113802 6 FM 1575
13: 113802 7 SM 13098
14: 113802 7 FM 2326
> setkey(TotFreq,Legacy_Store_Number,WeekDay,ItemType)
>
> ResultJoin <- TotFreq[Count_SM_FM]
Error in vecseq(f__, len__, if (allow.cartesian) NULL else as.integer(max(nrow(x), :
Join results in 320342 rows; more than 160171 = max(nrow(x),nrow(i)). Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including `j` and dropping `by` (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.
Run Code Online (Sandbox Code Playgroud)
但我没有重复的密钥i!
使用:
> ResultJoin <- TotFreq[Count_SM_FM,allow.cartesian=T]
>
> ResultJoin
Legacy_Store_Number WeekDay Date Item_Key Distr NoSellingDays meanUnits ItemType ItemType.1 ObjItems
1: 113802 1 2013-03-24 000000000120 2.428985e-04 0 8.00 FM SM 12305
2: 113802 1 2013-03-24 000000000126 1.104030e-03 0 47.50 FM SM 12305
3: 113802 1 2013-03-24 000000000170 1.126004e-03 0 48.75 FM SM 12305
4: 113802 1 2013-03-24 000000000180 5.143034e-04 0 19.00 FM SM 12305
5: 113802 1 2013-03-24 000000000260 3.854306e-04 0 12.25 FM SM 12305
---
320338: 113802 7 2013-03-23 978125002327 5.902655e-07 27 1.00 SM FM 2326
320339: 113802 7 2013-03-23 978141970584 1.770796e-06 25 1.00 SM FM 2326
320340: 113802 7 2013-03-23 978145300697 1.180531e-06 26 1.00 SM FM 2326
320341: 113802 7 2013-03-23 978145552558 5.902655e-07 27 1.00 SM FM 2326
320342: 113802 7 2013-03-23 978160139536 5.902655e-07 27 1.00 SM FM 2326
Run Code Online (Sandbox Code Playgroud)
事实上,我在原始TotFreq表格中记录了两倍.如果我Count_SM_FM在连接工作中也添加了一个键:
> setkey(TotFreq,Legacy_Store_Number,WeekDay,ItemType)
> setkey(Count_SM_FM,Legacy_Store_Number,WeekDay,ItemType)
> ResultJoin <- TotFreq[Count_SM_FM]
>
> ResultJoin
Legacy_Store_Number WeekDay ItemType Date Item_Key Distr NoSellingDays meanUnits ObjItems
1: 113802 1 FM 2013-03-24 000000000120 2.428985e-04 0 8.00 1942
2: 113802 1 FM 2013-03-24 000000000126 1.104030e-03 0 47.50 1942
3: 113802 1 FM 2013-03-24 000000000170 1.126004e-03 0 48.75 1942
4: 113802 1 FM 2013-03-24 000000000180 5.143034e-04 0 19.00 1942
5: 113802 1 FM 2013-03-24 000000000260 3.854306e-04 0 12.25 1942
---
160167: 113802 7 SM 2013-03-23 978125002327 5.902655e-07 27 1.00 13098
160168: 113802 7 SM 2013-03-23 978141970584 1.770796e-06 25 1.00 13098
160169: 113802 7 SM 2013-03-23 978145300697 1.180531e-06 26 1.00 13098
160170: 113802 7 SM 2013-03-23 978145552558 5.902655e-07 27 1.00 13098
160171: 113802 7 SM 2013-03-23 978160139536 5.902655e-07 27 1.00 13098
Run Code Online (Sandbox Code Playgroud)
我试图用一个例子验证,也许问题是没有关键变量作为第一列TotFreq或Count_SM_FM没有排序但我无法重现错误
> daysType <- data.table(
+ key1=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1),
+ key2=c(1,1,2,2,3,3,4,4,5,5,6,6,7,7),
+ key3=c("b","a","a","b","a","b","a","b","a","b","a","b","a","b"),
+ var1=c(2,4,6,8,4,5,7,3,7,9,6,3,5,6)
+ )
>
>
> detailData <- data.table(
+ key1=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1),
+ key2=c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,6,6,6,6,6,6,7,7,7,7,7,7,7,7),
+ var2=c(10,11,12,13,15,16,17,10,11,12,13,14,15,16,10,11,12,15,16,17,10,11,12,13,14,15,16,17,10,11,13,14,15,16,17,10,11,12,13,14,15,10,11,12,13,14,15,16,17),
+ var3=c(1,2,4,6,6,7,3,6,8,9,3,5,7,8,6,7,8,6,7,2,4,6,7,8,2,3,5,7,4,7,8,3,6,4,2,5,7,3,6,7,3,4,2,4,6,4,7,2,9),
+ key3=c("a","a","a","a","b","b","b","a","a","a","a","b","b","b","a","a","a","b","b","b","a","a","a","a","b","b","b","b","a","a","a","b","b","b","b","a","a","a","a","b","b","a","a","a","a","b","b","b","b")
+ )
>
> setkey(detailData,key1,key2,key3)
> JoinResult <- detailData[daysType]
Run Code Online (Sandbox Code Playgroud)
问题与有问题的不同
加入两个data.tables失败
因为那里allow.cartesian解决了这个问题.
这里有什么问题?为什么要添加密钥来Count_SM_FM解决它?
谢谢!
2014 年 10 月更新: Arun 在 v1.9.5 中修复了它:
allow.cartesiani现在,当没有重复项时,#742和#508会被忽略。感谢@nigmastar、@user3645882 和其他人的报告。
之前的回答...
首先让我们解决这allow.cartesian部分。错误消息可能应该更改为指出即使 中没有重复项i,但左侧有重复项,您也可以获得大尺寸data.table。这是一个简单的例子:
dt1 = data.table(a = c(1,1), b = 1:2, key = 'a')
dt2 = data.table(a = c(1,2), c = 3:4)
dt1[dt2] # this gives an error, because join results in 3 rows, as seen below
dt1[dt2, allow.cartesian = TRUE]
# a b c
#1: 1 1 3
#2: 1 2 3
#3: 2 NA 4
Run Code Online (Sandbox Code Playgroud)
现在就设置键而言 - 不,您不需要设置 的键i,它只会假设前几列是键。查看您的第一个加入结果,可以看到它尚未加入,ItemType并且您使用的是旧data.table版本(我使用的是 1.9.3)。所以我的猜测是,要么您实际上没有正确设置密钥并且没有包含该密钥ItemType,要么旧版本中存在一些错误,此后已修复。
| 归档时间: |
|
| 查看次数: |
5610 次 |
| 最近记录: |