计算sum_i f(i)x(i)x(i)'快?

Mem*_*ing 9 julia

我正在尝试计算列向量的f(i) * x(i) * x(i)' 位置的总和x(i),x(i)'是转置,并且f(i)是标量.所以它是外部产品的加权总和.

在MATLAB中,通过使用可以非常快速地实现bsxfun.以下代码在我的笔记本电脑上运行260毫秒(MacBook Air 2010)

N = 1e5;
d = 100;
f = randn(N, 1);
x = randn(N, d);
% H = zeros(d, d);

tic;
H = x' * bsxfun(@times, f, x);
toc
Run Code Online (Sandbox Code Playgroud)

我一直在努力让朱莉娅做同样的工作,但我不能更快地完成它.

N = int(1e5);
d = 100;
f = randn(N);
x = randn(N, d);

function hess1(x, f)
    N, d = size(x);
    temp = zeros(N, d);
    @simd for kk = 1:N
        @inbounds temp[kk, :] = f[kk] * x[kk, :];
    end
    H = x' * temp;
end

function hess2(x, f)
    N, d = size(x);
    H2 = zeros(d,d);
    @simd for k = 1:N
        @inbounds H2 += f[k] * x[k, :]' * x[k, :];
    end
    return H2
end

function hess3(x, f)
    N, d = size(x);
    H3 = zeros(d,d);
    for k = 1:N
        for k1 = 1:d
            @simd for k2 = 1:d
                @inbounds H3[k1, k2] += x[k, k1] * x[k, k2] * f[k];
            end
        end
    end
    return H3
end
Run Code Online (Sandbox Code Playgroud)

结果是

@time H1 = hess1(x, f);
@time H2 = hess2(x, f);
@time H3 = hess3(x, f);
elapsed time: 0.776116469 seconds (262480224 bytes allocated, 26.49% gc time)
elapsed time: 30.496472345 seconds (16385442496 bytes allocated, 56.07% gc time)
elapsed time: 2.769934563 seconds (80128 bytes allocated)
Run Code Online (Sandbox Code Playgroud)

hess1就像MATLAB一样bsxfun但速度较慢,并且不hess3使用临时内存,但速度要慢得多.我最好的julia代码比MATLAB慢3倍.

如何让这个julia代码更快?

IJulia要点:http://nbviewer.ipython.org/gist/memming/669fb8e78af3338ebf6f

Julia版本:0.3.0-rc1

编辑:我在更强大的计算机上测试(3.5 Ghz Intel i7,4核,L2 256kB,L3 8 MB)

  • MATLAB R2014a无-singleCompThread:0.053秒
  • MATLAB R2014a -singleCompThread:0.080小号(@勒托利的建议)
  • 朱莉娅0.3.0-rc1
    • hess1已用时间:0.215406904秒(分配262498648字节,32.34%gc时间)
    • hess2 已用时间:10.722578699秒(分配16384080176字节,gc时间62.20%)
    • hess3 已用时间:1.065504355秒(已分配80176个字节)
    • bsxfunstyle已用时间:0.063540168秒(分配80081072字节,25.04%gc时间)(@ IainDunning的解决方案)

实际上,使用broadcast速度更快,可与MATLAB的bsxfun相媲美.

Iai*_*ing 4

您正在寻找该broadcast功能。这是讨论功能和命名的相关问题

我实现了你的版本以及一个broadcast版本,这是我发现的:

srand(1988)
N = 100_000
d = 100
f = randn(N, 1)
x = randn(N, d)

function hess1(x, f)
    N, d = size(x);
    temp = zeros(N, d);
    @simd for kk = 1:N
        @inbounds temp[kk, :] = f[kk] * x[kk, :];
    end
    H = x' * temp;
end

function bsxfunstyle(x, f)
    x' * broadcast(*,f,x)
end

# Warmup
hess1(x,f)
bsxfunstyle(x, f)

# For real
println("Hess1")
@time H1 = hess1(x, f)
println("Broadcast")
@time H2 = bsxfunstyle(x, f)

# Check solutions are identical
println(sum(abs(H1-H2)))
Run Code Online (Sandbox Code Playgroud)

带输出

Hess1
elapsed time: 0.324256216 seconds (262498648 bytes allocated, 33.95% gc time)
Broadcast
elapsed time: 0.126647594 seconds (80080696 bytes allocated, 20.22% gc time)
0.0
Run Code Online (Sandbox Code Playgroud)

  • `bsxfun` 也是多线程的。如果您想了解其影响,可以比较“matlab -singleCompThread”。您现在可以使用“SharedArray”实现多线程算法,并且更普遍的线程即将到来。 (3认同)