Numpy:如何将观测转换为概率？

Question

Numpy:如何将观测转换为概率？

我有一个特征矩阵和一个相应的目标,它们是1或0:

# raw observations
features = np.array([[1, 1, 0],
                     [1, 1, 0],
                     [0, 1, 0],
                     [0, 1, 0],
                     [0, 1, 0],
                     [0, 0, 1]])

targets = np.array([1, 0, 1, 1, 0, 0])

Run Code Online (Sandbox Code Playgroud)

如您所见,每个功能可能对应于1和0.我需要将我的原始观察矩阵转换为概率矩阵,其中每个特征将对应于将其视为目标的概率:

[1 1 0] -> 0.5
[0 1 0] -> 0.67
[0 0 1] -> 0

Run Code Online (Sandbox Code Playgroud)

我构建了一个非常直接的解决方案:

import numpy as np

# raw observations
features = np.array([[1, 1, 0],
                     [1, 1, 0],
                     [0, 1, 0],
                     [0, 1, 0],
                     [0, 1, 0],
                     [0, 0, 1]])

targets = np.array([1, 0, 1, 1, 0, 0])

from collections import Counter

def convert_obs_to_proba(features, targets):
    features_ = []
    targets_ = []

    # compute unique rows (idx will point to some representative)
    b = np.ascontiguousarray(features).view(np.dtype((np.void, features.dtype.itemsize * features.shape[1])))
    _, idx = np.unique(b, return_index=True)

    idx = idx[::-1]

    zeros = Counter()
    ones = Counter()

    # collect row-wise number of one and zero targets
    for i, row in enumerate(features[:]):        
        if targets[i] == 0:
            zeros[tuple(row)] += 1
        else:
            ones[tuple(row)] += 1

    # iterate over unique features and compute probabilities
    for k in idx:
        unique_row = features[k]

        zero_count = zeros[tuple(unique_row)]
        one_count = ones[tuple(unique_row)]

        proba = float(one_count) / float(zero_count + one_count)

        features_.append(unique_row)
        targets_.append(proba)

    return np.array(features_), np.array(targets_)

features_, targets_ = convert_obs_to_proba(features, targets)

print(features_)
print(targets_)

Run Code Online (Sandbox Code Playgroud)

哪一个:

提取独特的功能;
计算每个独特特征的零个数和一个观测目标;
计算概率并构造结果.

它可以用一些先进的numpy魔法以更漂亮的方式解决吗？

更新.以前的代码是非常低效的O(n ^ 2).将其转换为更加性能友好.旧代码:

import numpy as np

# raw observations
features = np.array([[1, 1, 0],
                     [1, 1, 0],
                     [0, 1, 0],
                     [0, 1, 0],
                     [0, 1, 0],
                     [0, 0, 1]])

targets = np.array([1, 0, 1, 1, 0, 0])

def convert_obs_to_proba(features, targets):
    features_ = []
    targets_ = []

    # compute unique rows (idx will point to some representative)
    b = np.ascontiguousarray(features).view(np.dtype((np.void, features.dtype.itemsize * features.shape[1])))
    _, idx = np.unique(b, return_index=True)

    idx = idx[::-1]

    # calculate ZERO class occurences and ONE class occurences
    for k in idx:
        unique_row = features[k]

        zeros = 0
        ones = 0

        for i, row in enumerate(features[:]):        
            if np.array_equal(row, unique_row):            
                if targets[i] == 0:
                    zeros += 1
                else:
                    ones += 1

        proba = float(ones) / float(zeros + ones)

        features_.append(unique_row)
        targets_.append(proba)

    return np.array(features_), np.array(targets_)

features_, targets_ = convert_obs_to_proba(features, targets)

print(features_)
print(targets_)

Run Code Online (Sandbox Code Playgroud)

Answer 1

Joh*_*nck 5

使用熊猫很容易:

df = pd.DataFrame(features)
df['targets'] = targets

Run Code Online (Sandbox Code Playgroud)

现在你有:

   0  1  2  targets
0  1  1  0        1
1  1  1  0        0
2  0  1  0        1
3  0  1  0        1
4  0  1  0        0
5  0  0  1        0

Run Code Online (Sandbox Code Playgroud)

现在,花哨的部分:

df.groupby([0,1,2]).targets.mean()

Run Code Online (Sandbox Code Playgroud)

给你:

0  1  2
0  0  1    0.000000
   1  0    0.666667
1  1  0    0.500000
Name: targets, dtype: float64

Run Code Online (Sandbox Code Playgroud)

Pandas不会在0.666行的最左边部分打印0,但是如果你检查那里的值,它确实是0.

归档时间：	8 年，9 月前
查看次数：	1052 次
最近记录：	8 年，9 月前