与 NumPy 等效的 Pandas nunique

use*_*440 7 python numpy pandas

nunique在 numpy 中是否有与熊猫等效的行?我检查出np.uniquereturn_counts,但它似乎并没有回到我想要的东西。例如

a = np.array([[120.52971, 75.02052, 128.12627], [119.82573, 73.86636, 125.792],
       [119.16805, 73.89428, 125.38216],  [118.38071, 73.35443, 125.30198],
       [118.02871, 73.689514, 124.82088]])
uniqueColumns, occurCount = np.unique(a, axis=0, return_counts=True) ## axis=0 row-wise
Run Code Online (Sandbox Code Playgroud)

结果:

>>>ccurCount
array([1, 1, 1, 1, 1], dtype=int64)
Run Code Online (Sandbox Code Playgroud)

我应该期待 all3而不是 all 1

解决方法当然是转换为熊猫并调用,nunique但存在速度问题,我想探索一个纯 numpy 实现来加快速度。我正在处理大型数据帧,所以希望尽可能找到加速。我也愿意接受其他解决方案以加快速度。

Div*_*kar 3

我们可以使用一些排序和连续差异 -

\n\n
a.shape[1]-(np.diff(np.sort(a,axis=1),axis=1)==0).sum(1)\n
Run Code Online (Sandbox Code Playgroud)\n\n

对于一些性能。boost,我们可以用slicing替换np.diff-

\n\n
a_s = np.sort(a,axis=1)\nout = a.shape[1]-(a_s[:,:-1] == a_s[:,1:]).sum(1)\n
Run Code Online (Sandbox Code Playgroud)\n\n

如果你想引入一些容差值来检查唯一性,我们可以使用np.isclose-

\n\n
a.shape[1]-(np.isclose(np.diff(np.sort(a,axis=1),axis=1),0)).sum(1)\n
Run Code Online (Sandbox Code Playgroud)\n\n

样本运行 -

\n\n
In [51]: import pandas as pd\n\nIn [48]: a\nOut[48]: \narray([[120.52971 , 120.52971 , 128.12627 ],\n       [119.82573 ,  73.86636 , 125.792   ],\n       [119.16805 ,  73.89428 , 125.38216 ],\n       [118.38071 , 118.38071 , 118.38071 ],\n       [118.02871 ,  73.689514, 124.82088 ]])\n\nIn [49]: pd.DataFrame(a).nunique(axis=1).values\nOut[49]: array([2, 3, 3, 1, 3])\n\nIn [50]: a.shape[1]-(np.diff(np.sort(a,axis=1),axis=1)==0).sum(1)\nOut[50]: array([2, 3, 3, 1, 3])\n
Run Code Online (Sandbox Code Playgroud)\n\n

具有随机数且每行至少 2 个唯一数字的简单情况的计时 -

\n\n
In [41]: np.random.seed(0)\n    ...: a = np.random.rand(10000,5)\n    ...: a[:,-1] = a[:,0]\n\nIn [42]: %timeit pd.DataFrame(a).nunique(axis=1).values\n    ...: %timeit a.shape[1]-(np.diff(np.sort(a,axis=1),axis=1)==0).sum(1)\n1.31 s \xc2\xb1 39.5 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\n758 \xc2\xb5s \xc2\xb1 27.3 \xc2\xb5s per loop (mean \xc2\xb1 std. dev. of 7 runs, 1000 loops each)\n\nIn [43]: %%timeit\n    ...: a_s = np.sort(a,axis=1)\n    ...: out = a.shape[1]-(a_s[:,:-1] == a_s[:,1:]).sum(1)\n694 \xc2\xb5s \xc2\xb1 2.03 \xc2\xb5s per loop (mean \xc2\xb1 std. dev. of 7 runs, 1000 loops each)\n
Run Code Online (Sandbox Code Playgroud)\n