计算numpy ndarray中元素的数量

use*_*532 2 python numpy

如何计算ndarray中每个数据点的元素数?

我想要做的是在我的ndarray中至少出现N次的所有值上运行OneHotEncoder.

我还想将所有出现少于N次的值替换为它没有出现在数组中的另一个元素(让我们称之为new_value).

所以我举例说:

import numpy as np

a = np.array([[[2], [2,3], [3,34]],
              [[3], [4,5], [3,34]],
              [[3], [2,3], [3,4] ]]])
Run Code Online (Sandbox Code Playgroud)

阈值N = 2我想要的东西:

b = [OneHotEncoder(a[:,[i]])[0] if count(a[:,[i]])>2 
else OneHotEncoder(new_value) for i in range(a.shape(1)]
Run Code Online (Sandbox Code Playgroud)

所以只有理解我想要的替换,不考虑onehotencoder和使用new_value = 10我的数组应该是这样的:

a = np.array([[[10], [2,3], [3,34]],
                [[3], [10], [3,34]],
                [[3], [2,3], [10] ]]])
Run Code Online (Sandbox Code Playgroud)

Dan*_*iel 6

这样的事怎么样?

首先计算数组中unqiue元素的数量:

>>> a=np.random.randint(0,5,(3,3))
>>> a
array([[0, 1, 4],
       [0, 2, 4],
       [2, 4, 0]])
>>> ua,uind=np.unique(a,return_inverse=True)
>>> count=np.bincount(uind)
>>> ua
array([0, 1, 2, 4]) 
>>> count
array([3, 1, 2, 3]) 
Run Code Online (Sandbox Code Playgroud)

uacount数组中可以看出0表示3次,1表示1次,依此类推.

import numpy as np

def mask_fewest(arr,thresh,replace):
    ua,uind=np.unique(arr,return_inverse=True)
    count=np.bincount(uind)
    #Here ua has all of the unique elements, count will have the number of times 
    #each appears.


    #@Jamie's suggestion to make the rep_mask faster.
    rep_mask = np.in1d(uind, np.where(count < thresh))
    #Find which elements do not appear at least `thresh` times and create a mask

    arr.flat[rep_mask]=replace 
    #Replace elements based on above mask.

    return arr


>>> a=np.random.randint(2,8,(4,4))
[[6 7 7 3]
 [7 5 4 3]
 [3 5 2 3]
 [3 3 7 7]]


>>> mask_fewest(a,5,50)
[[10  7  7  3]
 [ 7  5 10  3]
 [ 3  5 10  3]
 [ 3  3  7  7]]
Run Code Online (Sandbox Code Playgroud)

对于上面的示例:如果您打算使用2D数组或3D数组,请告诉我.

>>> a
[[[2] [2, 3] [3, 34]]
 [[3] [4, 5] [3, 34]]
 [[3] [2, 3] [3, 4]]]


>>> mask_fewest(a,2,10)
[[10 [2, 3] [3, 34]]
 [[3] 10 [3, 34]]
 [[3] [2, 3] 10]]
Run Code Online (Sandbox Code Playgroud)

  • +1如果我有任何钱我会打赌它很快就会有一个`np.count_unique`函数在`np.unique`返回的索引上用`return_inverse = True`调用`np.bincount`,它是一个构造我发现自己一遍又一遍地打字.作为一个潜在的改进,我对你正在构建和折叠计算掩模的2D阵列有点困扰:这种技巧通常非常严重.我只是发现大型数据集的速度要快得多,而对于非常小的数据集来说,它的速度要慢得多:`rep_mask = np.in1d(a,ua [count <thresh])`. (2认同)