Joh*_*han 5 python numpy pandas
我想计算Pandas Dataframe上的分位数/百分位数.但是,功能非常慢.我用Numpy重复了一遍,我发现在Pandas中计算它需要花费近10 000倍!
有人知道为什么会这样吗?我应该使用Numpy计算它,然后创建一个新的DataFrame而不是使用Pandas?
请参阅下面的代码:
import time
import pandas as pd
import numpy as np
q = np.array([0.1,0.4,0.6,0.9])
data = np.random.randn(10000, 4)
df = pd.DataFrame(data, columns=['a', 'b', 'c', 'd'])
time1 = time.time()
pandas_quantiles = df.quantile(q, axis=1)
time2 = time.time()
print 'Pandas took %0.3f ms' % ((time2-time1)*1000.0)
time1 = time.time()
numpy_quantiles = np.percentile(data, q*100, axis=1)
time2 = time.time()
print 'Numpy took %0.3f ms' % ((time2-time1)*1000.0)
print (pandas_quantiles.values == numpy_quantiles).all()
# Output:
# Pandas took 15337.531 ms
# Numpy took 1.653 ms
# True
Run Code Online (Sandbox Code Playgroud)
最近版本的 Pandas 使用 python 3 解决了这个问题。Pandas 在小型数组上的长度不到两倍,而在较大数组上我得到了 5% 的差异。
我使用 pandas 0.24.1 和 Python 3 得到以下输出:
import time
import pandas as pd
import numpy as np
q = np.array([0.1,0.4,0.6,0.9])
data = np.random.randn(10000, 4)
df = pd.DataFrame(data, columns=['a', 'b', 'c', 'd'])
time1 = time.time()
pandas_quantiles = df.quantile(q, axis=1)
time2 = time.time()
print 'Pandas took %0.3f ms' % ((time2-time1)*1000.0)
time1 = time.time()
numpy_quantiles = np.percentile(data, q*100, axis=1)
time2 = time.time()
print 'Numpy took %0.3f ms' % ((time2-time1)*1000.0)
print (pandas_quantiles.values == numpy_quantiles).all()
# Output:
# Pandas took 3.415 ms
# Numpy took 2.040 ms
# True
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
547 次 |
| 最近记录: |