Julia中的基尼系数:高效准确的代码

Dav*_*nro 7 statistics inequality distribution julia

我正试图在Julia中实施以下公式来计算工资分配的基尼系数:

在此输入图像描述

哪里 在此输入图像描述

这是我正在使用的代码的简化版本:

# Takes a array where first column is value of wages
# (y_i in formula), and second column is probability
# of wage value (f(y_i) in formula).
function gini(wagedistarray)
    # First calculate S values in formula
    for i in 1:length(wagedistarray[:,1])
        for j in 1:i
            Swages[i]+=wagedistarray[j,2]*wagedistarray[j,1]
        end
    end

    # Now calculate value to subtract from 1 in gini formula
    Gwages = Swages[1]*wagedistarray[1,2]
    for i in 2:length(Swages)
        Gwages += wagedistarray[i,2]*(Swages[i]+Swages[i-1])
    end

    # Final step of gini calculation
    return giniwages=1-(Gwages/Swages[length(Swages)])          
end

wagedistarray=zeros(10000,2)                                 
Swages=zeros(length(wagedistarray[:,1]))                    

for i in 1:length(wagedistarray[:,1])
   wagedistarray[i,1]=1
   wagedistarray[i,2]=1/10000
end


@time result=gini(wagedistarray)
Run Code Online (Sandbox Code Playgroud)

它给出了接近零的值,这是您对完全相等的工资分配的期望.但是,它需要相当长的时间:6.796秒.

有什么改进的想法吗?

Iai*_*ing 13

试试这个:

function gini(wagedistarray)
    nrows = size(wagedistarray,1)
    Swages = zeros(nrows)
    for i in 1:nrows
        for j in 1:i
            Swages[i] += wagedistarray[j,2]*wagedistarray[j,1]
        end
    end

    Gwages=Swages[1]*wagedistarray[1,2]
    for i in 2:nrows
        Gwages+=wagedistarray[i,2]*(Swages[i]+Swages[i-1])
    end

    return 1-(Gwages/Swages[length(Swages)])

end

wagedistarray=zeros(10000,2)
for i in 1:size(wagedistarray,1)
   wagedistarray[i,1]=1
   wagedistarray[i,2]=1/10000
end

@time result=gini(wagedistarray)
Run Code Online (Sandbox Code Playgroud)
  • 时间之前: 5.913907256 seconds (4000481676 bytes allocated, 25.37% gc time)
  • 时间过后: 0.134799301 seconds (507260 bytes allocated)
  • 之后的时间(第二次运行): elapsed time: 0.123665107 seconds (80112 bytes allocated)

主要问题是这Swages是一个全局变量(不是生活在函数中),这不是一个好的编码实践,但更重要的是性能杀手.我注意到的另一件事是length(wagedistarray[:,1]),它制作了该列的副本然后询问它的长度 - 这产生了一些额外的"垃圾".第二次运行速度更快,因为第一次运行该函数时会有一些编译时间.

你使用的曲柄性能更高@inbounds,即

function gini(wagedistarray)
    nrows = size(wagedistarray,1)
    Swages = zeros(nrows)
    @inbounds for i in 1:nrows
        for j in 1:i
            Swages[i] += wagedistarray[j,2]*wagedistarray[j,1]
        end
    end

    Gwages=Swages[1]*wagedistarray[1,2]
    @inbounds for i in 2:nrows
        Gwages+=wagedistarray[i,2]*(Swages[i]+Swages[i-1])
    end

    return 1-(Gwages/Swages[length(Swages)])
end
Run Code Online (Sandbox Code Playgroud)

这给了我 elapsed time: 0.042070662 seconds (80112 bytes allocated)

最后,看看这个版本,它实际上比所有版本都快,也是我认为最准确的版本:

function gini2(wagedistarray)
    Swages = cumsum(wagedistarray[:,1].*wagedistarray[:,2])
    Gwages = Swages[1]*wagedistarray[1,2] +
                sum(wagedistarray[2:end,2] .* 
                        (Swages[2:end]+Swages[1:end-1]))
    return 1 - Gwages/Swages[end]
end
Run Code Online (Sandbox Code Playgroud)

哪有elapsed time: 0.00041119 seconds (721664 bytes allocated).主要好处是从O(n ^ 2)双循环变为O(n)cumsum.


And*_*yer 5

IainDunning 已经提供了一个很好的答案,其代码对于实际目的来说足够快(函数gini2)。如果您喜欢性能调整,则可以通过避免临时数组 ( ) 将速度额外提高 20 倍gini3。请参阅以下代码,比较两种实现的性能:

\n\n
using TimeIt\n\nwagedistarray=zeros(10000,2)\nfor i in 1:size(wagedistarray,1)\n   wagedistarray[i,1]=1\n   wagedistarray[i,2]=1/10000\nend\n\nwages = wagedistarray[:,1]\nwagefrequencies = wagedistarray[:,2];\n\n# original code\nfunction gini2(wagedistarray)\n    Swages = cumsum(wagedistarray[:,1].*wagedistarray[:,2])\n    Gwages = Swages[1]*wagedistarray[1,2] +\n                sum(wagedistarray[2:end,2] .* \n                        (Swages[2:end]+Swages[1:end-1]))\n    return 1 - Gwages/Swages[end]\nend\n\n# new code\nfunction gini3(wages, wagefrequencies)\n    Swages_previous = wages[1]*wagefrequencies[1]\n    Gwages = Swages_previous*wagefrequencies[1]\n    @inbounds for i = 2:length(wages)\n        freq = wagefrequencies[i]\n        Swages_current = Swages_previous + wages[i]*freq\n        Gwages += freq * (Swages_current+Swages_previous)\n        Swages_previous = Swages_current\n    end\n    return 1.0 - Gwages/Swages_previous\nend\n\nresult=gini2(wagedistarray) # warming up JIT\nprintln("result with gini2: $result, time:")\n@timeit result=gini2(wagedistarray)\n\nresult=gini3(wages, wagefrequencies) # warming up JIT\nprintln("result with gini3: $result, time:")\n@timeit result=gini3(wages, wagefrequencies)\n
Run Code Online (Sandbox Code Playgroud)\n\n

输出是:

\n\n
result with gini2: 0.0, time:\n1000 loops, best of 3: 321.57 \xc2\xb5s per loop\nresult with gini3: -1.4210854715202004e-14, time:\n10000 loops, best of 3: 16.24 \xc2\xb5s per loop\n
Run Code Online (Sandbox Code Playgroud)\n\n

gini3由于顺序求和的准确性稍差gini2,因此必须使用成对求和的一种变体来提高准确性。

\n