我试图了解我对堪培拉距离的计算是怎么回事.我编写自己的简单canberra.distance函数,但结果与dist函数不一致.我na.rm = T为我的函数添加了选项,以便能够在零分母时计算总和.据?dist我了解,他们使用类似的方法:Terms with zero numerator and denominator are omitted from the sum and treated as if the values were missing.
canberra.distance <- function(a, b){
sum( (abs(a - b)) / (abs(a) + abs(b)), na.rm = T )
}
a <- c(0, 1, 0, 0, 1)
b <- c(1, 0, 1, 0, 1)
canberra.distance(a, b)
> 3
# the result that I expected
dist(rbind(a, b), method = "canberra")
> 3.75
a <- c(0, 1, 0, 0)
b <- c(1, 0, 1, 0)
canberra.distance(a, b)
> 3
# the result that I expected
dist(rbind(a, b), method = "canberra")
> 4
a <- c(0, 1, 0)
b <- c(1, 0, 1)
canberra.distance(a, b)
> 3
dist(rbind(a, b), method = "canberra")
> 3
# now the results are the same
Run Code Online (Sandbox Code Playgroud)
对0-0和1-1似乎有问题.在第一种情况(0-0)中,分子和分母都等于零,并且应该省略该对.在第二种情况(1-1)中,分子为0但分母不是,且该项也是0,并且总和不应改变.
我在这里错过了什么?
编辑:
为了符合R定义,功能canberra.distance可以修改如下:
canberra.distance <- function(a, b){
sum( abs(a - b) / abs(a + b), na.rm = T )
}
Run Code Online (Sandbox Code Playgroud)
但是,结果与以前相同.
这可能会揭示其中的差异。据我所知,这是用于计算距离的实际代码
static double R_canberra(double *x, int nr, int nc, int i1, int i2)
{
double dev, dist, sum, diff;
int count, j;
count = 0;
dist = 0;
for(j = 0 ; j < nc ; j++) {
if(both_non_NA(x[i1], x[i2])) {
sum = fabs(x[i1] + x[i2]);
diff = fabs(x[i1] - x[i2]);
if (sum > DBL_MIN || diff > DBL_MIN) {
dev = diff/sum;
if(!ISNAN(dev) ||
(!R_FINITE(diff) && diff == sum &&
/* use Inf = lim x -> oo */ (int) (dev = 1.))) {
dist += dev;
count++;
}
}
}
i1 += nr;
i2 += nr;
}
if(count == 0) return NA_REAL;
if(count != nc) dist /= ((double)count/nc);
return dist;
}
Run Code Online (Sandbox Code Playgroud)
我认为罪魁祸首是这条线
if(!ISNAN(dev) ||
(!R_FINITE(diff) && diff == sum &&
/* use Inf = lim x -> oo */ (int) (dev = 1.)))
Run Code Online (Sandbox Code Playgroud)
它处理特殊情况并且可能不会被记录。
| 归档时间: |
|
| 查看次数: |
454 次 |
| 最近记录: |