data.table join(vecseq中的错误)是X和i上的关键所必需的?

use*_*882 8 join r data.table

我是R的新手data.table,我发现它很有用而且很快.我正在尝试加入2个数据表:

> TotFreq
        Legacy_Store_Number WeekDay       Date     Item_Key        Distr NoSellingDays meanUnits ItemType
     1:              113802       1 2013-03-24 000000000120 2.428985e-04             0      8.00       FM
     2:              113802       1 2013-03-24 000000000126 1.104030e-03             0     47.50       FM
     3:              113802       1 2013-03-24 000000000170 1.126004e-03             0     48.75       FM
     4:              113802       1 2013-03-24 000000000180 5.143034e-04             0     19.00       FM
     5:              113802       1 2013-03-24 000000000260 3.854306e-04             0     12.25       FM
160167:              113802       7 2013-03-23 978125002327 5.902655e-07            27      1.00       SM
160168:              113802       7 2013-03-23 978141970584 1.770796e-06            25      1.00       SM
160169:              113802       7 2013-03-23 978145300697 1.180531e-06            26      1.00       SM
160170:              113802       7 2013-03-23 978145552558 5.902655e-07            27      1.00       SM
160171:              113802       7 2013-03-23 978160139536 5.902655e-07            27      1.00       SM

> Count_SM_FM
    Legacy_Store_Number WeekDay ItemType ObjItems
 1:              113802       1       SM    12305
 2:              113802       1       FM     1942
 3:              113802       2       SM    11014
 4:              113802       2       FM     1398
 5:              113802       3       SM    10154
 6:              113802       3       FM     1117
 7:              113802       4       SM    10414
 8:              113802       4       FM     1167
 9:              113802       5       SM    10258
10:              113802       5       FM     1200
11:              113802       6       SM    11116
12:              113802       6       FM     1575
13:              113802       7       SM    13098
14:              113802       7       FM     2326
> setkey(TotFreq,Legacy_Store_Number,WeekDay,ItemType)
> 
> ResultJoin <- TotFreq[Count_SM_FM]
Error in vecseq(f__, len__, if (allow.cartesian) NULL else as.integer(max(nrow(x),  : 
  Join results in 320342 rows; more than 160171 = max(nrow(x),nrow(i)). Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including `j` and dropping `by` (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.
Run Code Online (Sandbox Code Playgroud)

但我没有重复的密钥i!

使用:

> ResultJoin <- TotFreq[Count_SM_FM,allow.cartesian=T]
> 
> ResultJoin
        Legacy_Store_Number WeekDay       Date     Item_Key        Distr NoSellingDays meanUnits ItemType ItemType.1 ObjItems
     1:              113802       1 2013-03-24 000000000120 2.428985e-04             0      8.00       FM         SM    12305
     2:              113802       1 2013-03-24 000000000126 1.104030e-03             0     47.50       FM         SM    12305
     3:              113802       1 2013-03-24 000000000170 1.126004e-03             0     48.75       FM         SM    12305
     4:              113802       1 2013-03-24 000000000180 5.143034e-04             0     19.00       FM         SM    12305
     5:              113802       1 2013-03-24 000000000260 3.854306e-04             0     12.25       FM         SM    12305
    ---                                                                                                                      
320338:              113802       7 2013-03-23 978125002327 5.902655e-07            27      1.00       SM         FM     2326
320339:              113802       7 2013-03-23 978141970584 1.770796e-06            25      1.00       SM         FM     2326
320340:              113802       7 2013-03-23 978145300697 1.180531e-06            26      1.00       SM         FM     2326
320341:              113802       7 2013-03-23 978145552558 5.902655e-07            27      1.00       SM         FM     2326
320342:              113802       7 2013-03-23 978160139536 5.902655e-07            27      1.00       SM         FM     2326
Run Code Online (Sandbox Code Playgroud)

事实上,我在原始TotFreq表格中记录了两倍.如果我Count_SM_FM在连接工作中也添加了一个键:

> setkey(TotFreq,Legacy_Store_Number,WeekDay,ItemType)
> setkey(Count_SM_FM,Legacy_Store_Number,WeekDay,ItemType)
> ResultJoin <- TotFreq[Count_SM_FM]
> 
> ResultJoin
        Legacy_Store_Number WeekDay ItemType       Date     Item_Key        Distr NoSellingDays meanUnits ObjItems
     1:              113802       1       FM 2013-03-24 000000000120 2.428985e-04             0      8.00     1942
     2:              113802       1       FM 2013-03-24 000000000126 1.104030e-03             0     47.50     1942
     3:              113802       1       FM 2013-03-24 000000000170 1.126004e-03             0     48.75     1942
     4:              113802       1       FM 2013-03-24 000000000180 5.143034e-04             0     19.00     1942
     5:              113802       1       FM 2013-03-24 000000000260 3.854306e-04             0     12.25     1942
    ---                                                                                                           
160167:              113802       7       SM 2013-03-23 978125002327 5.902655e-07            27      1.00    13098
160168:              113802       7       SM 2013-03-23 978141970584 1.770796e-06            25      1.00    13098
160169:              113802       7       SM 2013-03-23 978145300697 1.180531e-06            26      1.00    13098
160170:              113802       7       SM 2013-03-23 978145552558 5.902655e-07            27      1.00    13098
160171:              113802       7       SM 2013-03-23 978160139536 5.902655e-07            27      1.00    13098
Run Code Online (Sandbox Code Playgroud)

我试图用一个例子验证,也许问题是没有关键变量作为第一列TotFreqCount_SM_FM没有排序但我无法重现错误

> daysType <- data.table(
+     key1=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1),
+     key2=c(1,1,2,2,3,3,4,4,5,5,6,6,7,7),
+     key3=c("b","a","a","b","a","b","a","b","a","b","a","b","a","b"),
+     var1=c(2,4,6,8,4,5,7,3,7,9,6,3,5,6)
+ )        
> 
> 
> detailData <- data.table(
+     key1=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1),
+     key2=c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,6,6,6,6,6,6,7,7,7,7,7,7,7,7),
+     var2=c(10,11,12,13,15,16,17,10,11,12,13,14,15,16,10,11,12,15,16,17,10,11,12,13,14,15,16,17,10,11,13,14,15,16,17,10,11,12,13,14,15,10,11,12,13,14,15,16,17),
+     var3=c(1,2,4,6,6,7,3,6,8,9,3,5,7,8,6,7,8,6,7,2,4,6,7,8,2,3,5,7,4,7,8,3,6,4,2,5,7,3,6,7,3,4,2,4,6,4,7,2,9),
+     key3=c("a","a","a","a","b","b","b","a","a","a","a","b","b","b","a","a","a","b","b","b","a","a","a","a","b","b","b","b","a","a","a","b","b","b","b","a","a","a","a","b","b","a","a","a","a","b","b","b","b")
+ )        
> 
> setkey(detailData,key1,key2,key3)
> JoinResult <- detailData[daysType]
Run Code Online (Sandbox Code Playgroud)

问题与有问题的不同

加入两个data.tables失败

因为那里allow.cartesian解决了这个问题.

这里有什么问题?为什么要添加密钥来Count_SM_FM解决它?

谢谢!

edd*_*ddi 4

2014 年 10 月更新: Arun 在 v1.9.5 中修复了它:

allow.cartesiani现在,当没有重复项时,#742#508会被忽略。感谢@nigmastar、@user3645882 和其他人的报告。



之前的回答...

首先让我们解决这allow.cartesian部分。错误消息可能应该更改为指出即使 中没有重复项i,但左侧​​有重复项,您也可以获得大尺寸data.table。这是一个简单的例子:

dt1 = data.table(a = c(1,1), b = 1:2, key = 'a')
dt2 = data.table(a = c(1,2), c = 3:4)

dt1[dt2] # this gives an error, because join results in 3 rows, as seen below

dt1[dt2, allow.cartesian = TRUE]
#   a  b c
#1: 1  1 3
#2: 1  2 3
#3: 2 NA 4
Run Code Online (Sandbox Code Playgroud)

现在就设置键而言 - 不,您不需要设置 的键i,它只会假设前几列是键。查看您的第一个加入结果,可以看到它尚未加入ItemType并且您使用的是旧data.table版本(我使用的是 1.9.3)。所以我的猜测是,要么您实际上没有正确设置密钥并且没有包含该密钥ItemType,要么旧版本中存在一些错误,此后已修复。