Ana*_*sid 3 python knn dataframe pandas julia
我决定将一些Python代码从Peter Harrington的机器学习中转换为Julia,从kNN算法开始.
归他提供了一个数据集后,我写了几个功能:find_kNN(),mass_kNN(即发现k近邻的多个输入的功能),以及一个分割给定数据集为随机挑选训练和测试数据集,调用一个函数mass_kNN(),并将得到的准确性多次绘制.
然后我比较了Julia代码和等效的Python代码之间的运行时间.(我在朱莉娅使用距离来找到欧几里德距离和牛羚进行绘图,但是关闭绘图并不会影响时间.)
结果:
Julia:
已用时间:1.175523034秒(分配455531636字节,gc时间47.54%)
Python:
时间流逝:0.9517326354980469秒
我想知道是否有办法加速我的Julia代码,或者它是否在此时尽可能快地运行(我的意思是,如果在使代码运行最快的方面有任何明显的错误.)
谢谢!..
编辑:删除convert()语句并将所有内容传递给Real,将时间减慢到2.29秒.
So first of all, I removed the last two lines of plot_probs - plotting isn't really a great thing to benchmark I think, and it's largely beyond my (or your) control - could try PyPlot if its a real factor. I also timed plot_probs a few times to see how much time is spent compiling it the first time:
**********elapsed time: 1.071184218 seconds (473218720 bytes allocated, 26.36% gc time)
**********elapsed time: 0.658809962 seconds (452017744 bytes allocated, 40.29% gc time)
**********elapsed time: 0.660609145 seconds (452017680 bytes allocated, 40.45% gc time)
Run Code Online (Sandbox Code Playgroud)
So there is a 0.3s penalty paid once. Moving on to the actual algorithm, I used the in built profiler (e.g. @profile plot_probs(norm_array, 0.25, [1:3], 10, 3)), which revealed all of the time (essentially) is spent here:
[ push!(dist, euclidean(set_array[i,:][:], input_array)) for i in 1:size(set_array, 1) ][ d[i] = get(d, i, 0) + 1 for i in labels[sortperm(dist)][1:k] ]使用像这样的数组理解并不是惯用的Julia(或Python).第一个也很慢,因为所有切片都会产生许多数据副本.我不是专家Distances.jl,但我认为你可以替换它
dist = Distances.colwise(Euclidean(), set_array', input_array)
d = Dict{Int,Int}()
for i in labels[sortperm(dist)][1:k]
d[i] = get(d, i, 0) + 1
end
Run Code Online (Sandbox Code Playgroud)
这给了我
**********elapsed time: 0.731732444 seconds (234734112 bytes allocated, 20.90% gc time)
**********elapsed time: 0.30319397 seconds (214057552 bytes allocated, 37.84% gc time)
Run Code Online (Sandbox Code Playgroud)
一旦进行转置mass_kNN,可以提取更多的性能,但这需要触摸太多的地方,这个帖子足够长.试图微观优化它导致我使用
dist=zeros(size(set_array, 1)
@inbounds for i in 1:size(set_array, 1)
d = 0.0
for j in 1:length(input_array)
z = set_array[i,j] - input_array[j]
d += z*z
end
dist[i] = sqrt(d)
end
Run Code Online (Sandbox Code Playgroud)
得到它
**********elapsed time: 0.646256408 seconds (158869776 bytes allocated, 15.21% gc time)
**********elapsed time: 0.245293449 seconds (138817648 bytes allocated, 35.40% gc time)
Run Code Online (Sandbox Code Playgroud)
所以花了大约一半的时间 - 但不是真的值得,而且不太灵活(例如,如果我想要L1).其他代码审查点(未经请求,我知道):
Vector{Float64}和Matrix{Float64}对眼睛较容易得多Array{Float64,1}且Array{Float64,2}并不太可能混淆.Float64[] 比平常更常见 Array(Float64, 0)Int64可以写成Int,因为它不需要是64位整数.| 归档时间: |
|
| 查看次数: |
458 次 |
| 最近记录: |