Ofe*_*man 7 python performance numpy scikit-learn
我试图sklearn.decomposition.TruncatedSVD()在2台不同的计算机上运行,并了解性能差异.
电脑1(Windows 7,物理电脑)
OS Name Microsoft Windows 7 Professional
System Type x64-based PC
Processor Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz, 3401 Mhz, 4 Core(s),
8 Logical Installed Physical Memory (RAM) 8.00 GB
Total Physical Memory 7.89 GB
Run Code Online (Sandbox Code Playgroud)
电脑2(Debian,在亚马逊云上)
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
width: 64 bits
capabilities: ldt16 vsyscall32
*-core
description: Motherboard
physical id: 0
*-memory
description: System memory
physical id: 0
size: 29GiB
*-cpu
product: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
vendor: Intel Corp.
physical id: 1
bus info: cpu@0
width: 64 bits
Run Code Online (Sandbox Code Playgroud)
电脑3(Windows 2008R2,亚马逊云)
OS Name Microsoft Windows Server 2008 R2 Datacenter
Version 6.1.7601 Service Pack 1 Build 7601
System Type x64-based PC
Processor Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz, 2500 Mhz,
4 Core(s), 8 Logical Processor(s)
Installed Physical Memory (RAM) 30.0 GB
Run Code Online (Sandbox Code Playgroud)
两台计算机都运行Python 3.2和相同的sklearn,numpy,scipy版本
我cProfile按如下方式运行:
print(vectors.shape)
>>> (7500, 2042)
_decomp = TruncatedSVD(n_components=680, random_state=1)
global _o
_o = _decomp
cProfile.runctx('_o.fit_transform(vectors)', globals(), locals(), sort=1)
Run Code Online (Sandbox Code Playgroud)
电脑1输出
>>> 833 function calls in 1.710 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.767 0.767 0.782 0.782 decomp_svd.py:15(svd)
1 0.249 0.249 0.249 0.249 {method 'enable' of '_lsprof.Profiler' objects}
1 0.183 0.183 0.183 0.183 {method 'normal' of 'mtrand.RandomState' objects}
6 0.174 0.029 0.174 0.029 {built-in method csr_matvecs}
6 0.123 0.021 0.123 0.021 {built-in method csc_matvecs}
2 0.110 0.055 0.110 0.055 decomp_qr.py:14(safecall)
1 0.035 0.035 0.035 0.035 {built-in method dot}
1 0.020 0.020 0.589 0.589 extmath.py:185(randomized_range_finder)
2 0.018 0.009 0.019 0.010 function_base.py:532(asarray_chkfinite)
24 0.014 0.001 0.014 0.001 {method 'ravel' of 'numpy.ndarray' objects}
1 0.007 0.007 0.009 0.009 twodim_base.py:427(triu)
1 0.004 0.004 1.710 1.710 extmath.py:232(randomized_svd)
Run Code Online (Sandbox Code Playgroud)
电脑2输出
>>> 858 function calls in 40.145 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
2 32.116 16.058 32.116 16.058 {built-in method dot}
1 6.148 6.148 6.156 6.156 decomp_svd.py:15(svd)
2 0.561 0.281 0.561 0.281 decomp_qr.py:14(safecall)
6 0.561 0.093 0.561 0.093 {built-in method csr_matvecs}
1 0.337 0.337 0.337 0.337 {method 'normal' of 'mtrand.RandomState' objects}
6 0.202 0.034 0.202 0.034 {built-in method csc_matvecs}
1 0.052 0.052 1.633 1.633 extmath.py:183(randomized_range_finder)
1 0.045 0.045 0.054 0.054 _methods.py:73(_var)
1 0.023 0.023 0.023 0.023 {method 'argmax' of 'numpy.ndarray' objects}
1 0.023 0.023 0.046 0.046 extmath.py:531(svd_flip)
1 0.016 0.016 40.145 40.145 <string>:1(<module>)
24 0.011 0.000 0.011 0.000 {method 'ravel' of 'numpy.ndarray' objects}
6 0.009 0.002 0.009 0.002 {method 'reduce' of 'numpy.ufunc' objects}
2 0.008 0.004 0.009 0.004 function_base.py:532(asarray_chkfinite)
Run Code Online (Sandbox Code Playgroud)
电脑3输出
>>> 858 function calls in 2.223 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.956 0.956 0.972 0.972 decomp_svd.py:15(svd)
2 0.306 0.153 0.306 0.153 {built-in method dot}
1 0.274 0.274 0.274 0.274 {method 'normal' of 'mtrand.RandomState' objects}
6 0.205 0.034 0.205 0.034 {built-in method csr_matvecs}
6 0.151 0.025 0.151 0.025 {built-in method csc_matvecs}
2 0.133 0.067 0.133 0.067 decomp_qr.py:14(safecall)
1 0.032 0.032 0.043 0.043 _methods.py:73(_var)
1 0.030 0.030 0.030 0.030 {method 'argmax' of 'numpy.ndarray' objects}
24 0.026 0.001 0.026 0.001 {method 'ravel' of 'numpy.ndarray' objects}
2 0.019 0.010 0.020 0.010 function_base.py:532(asarray_chkfinite)
1 0.019 0.019 0.773 0.773 extmath.py:183(randomized_range_finder)
1 0.019 0.019 0.049 0.049 extmath.py:531(svd_flip)
Run Code Online (Sandbox Code Playgroud)
注意{内置方法点}差异从0.035s /调用到16.058s/call,慢450倍!!
------+---------+---------+---------+---------+---------------------------------------
ncalls| tottime | percall | cumtime | percall | filename:lineno(function) HARDWARE
------+---------+---------+---------+---------+---------------------------------------
1 | 0.035 | 0.035 | 0.035 | 0.035 | {built-in method dot} Computer 1
2 | 32.116 | 16.058 | 32.116 | 16.058 | {built-in method dot} Computer 2
2 | 0.306 | 0.153 | 0.306 | 0.153 | {built-in method dot} Computer 3
Run Code Online (Sandbox Code Playgroud)
我知道应该存在性能差异,但我应该这么高吗?
有没有办法可以进一步调试这个性能问题?
编辑
我测试了一台新计算机,计算机3,其硬件类似于计算机2和不同的操作系统
结果是0.153s/{内置方法点}的调用仍然比Linux快100倍!
编辑2
电脑1 numpy配置
>>> np.__config__.show()
lapack_opt_info:
libraries = ['mkl_lapack95_lp64', 'mkl_blas95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'libiomp5md', 'libifportmd', 'mkl_lapack95_lp64', 'mkl_blas95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'libiomp5md', 'libifportmd']
library_dirs = ['C:/Program Files (x86)/Intel/Composer XE/mkl/lib/intel64']
define_macros = [('SCIPY_MKL_H', None)]
include_dirs = ['C:/Program Files (x86)/Intel/Composer XE/mkl/include']
blas_opt_info:
libraries = ['mkl_lapack95_lp64', 'mkl_blas95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'libiomp5md', 'libifportmd']
library_dirs = ['C:/Program Files (x86)/Intel/Composer XE/mkl/lib/intel64']
define_macros = [('SCIPY_MKL_H', None)]
include_dirs = ['C:/Program Files (x86)/Intel/Composer XE/mkl/include']
openblas_info:
NOT AVAILABLE
lapack_mkl_info:
libraries = ['mkl_lapack95_lp64', 'mkl_blas95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'libiomp5md', 'libifportmd', 'mkl_lapack95_lp64', 'mkl_blas95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'libiomp5md', 'libifportmd']
library_dirs = ['C:/Program Files (x86)/Intel/Composer XE/mkl/lib/intel64']
define_macros = [('SCIPY_MKL_H', None)]
include_dirs = ['C:/Program Files (x86)/Intel/Composer XE/mkl/include']
blas_mkl_info:
libraries = ['mkl_lapack95_lp64', 'mkl_blas95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'libiomp5md', 'libifportmd']
library_dirs = ['C:/Program Files (x86)/Intel/Composer XE/mkl/lib/intel64']
define_macros = [('SCIPY_MKL_H', None)]
include_dirs = ['C:/Program Files (x86)/Intel/Composer XE/mkl/include']
mkl_info:
libraries = ['mkl_lapack95_lp64', 'mkl_blas95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'libiomp5md', 'libifportmd']
library_dirs = ['C:/Program Files (x86)/Intel/Composer XE/mkl/lib/intel64']
define_macros = [('SCIPY_MKL_H', None)]
include_dirs = ['C:/Program Files (x86)/Intel/Composer XE/mkl/include']
Run Code Online (Sandbox Code Playgroud)
电脑2 numpy配置
>>> np.__config__.show()
lapack_info:
NOT AVAILABLE
lapack_opt_info:
NOT AVAILABLE
blas_info:
libraries = ['blas']
library_dirs = ['/usr/lib']
language = f77
atlas_threads_info:
NOT AVAILABLE
atlas_blas_info:
NOT AVAILABLE
lapack_src_info:
NOT AVAILABLE
openblas_info:
NOT AVAILABLE
atlas_blas_threads_info:
NOT AVAILABLE
blas_mkl_info:
NOT AVAILABLE
blas_opt_info:
libraries = ['blas']
library_dirs = ['/usr/lib']
language = f77
define_macros = [('NO_ATLAS_INFO', 1)]
atlas_info:
NOT AVAILABLE
lapack_mkl_info:
NOT AVAILABLE
mkl_info:
NOT AVAILABLE
Run Code Online (Sandbox Code Playgroud)
{built-in method dot}是np.dot函数,它是围绕CBLAS例程的NumPy包装器,用于矩阵 - 矩阵,矩阵 - 向量和向量 - 向量乘法.您的Windows机器使用经过大量调整的英特尔MKL版本的CBLAS.Linux机器正在使用缓慢的旧参考实现.
如果您安装ATLAS或OpenBLAS(两者都可以通过Linux软件包管理器提供),或者实际上安装英特尔MKL,您可能会看到大量的加速.尝试sudo apt-get install libatlas-dev,再次检查NumPy配置,看它是否选择了ATLAS,并再次测量.
一旦您决定使用正确的CBLAS库,您可能需要重新编译scikit-learn.其中大多数仅仅使用NumPy来满足其线性代数需求,但是一些算法(特别是k-means)直接使用CBLAS.
操作系统与此无关.
| 归档时间: |
|
| 查看次数: |
2571 次 |
| 最近记录: |