Python 添加与列值关联的权重

Question

Python 添加与列值关联的权重

我正在与一家前大型数据公司合作。这是一个示例：

import pandas as pd
import numpy as np
df = pd.DataFrame({ 
'ID': ['A', 'A', 'A', 'X', 'X', 'Y'], 
})
 ID
0  A
1  A
2  A
3  X
4  X
5  Y

Run Code Online (Sandbox Code Playgroud)

现在，给定“ID”列中每个值的频率，我想使用下面的函数计算权重，并添加一个具有与“ID”中每个值关联的权重的列。

def get_weights_inverse_num_of_samples(label_counts, power=1.):
    no_of_classes = len(label_counts)
    weights_for_samples = 1.0/np.power(np.array(label_counts), power)
    weights_for_samples = weights_for_samples/ np.sum(weights_for_samples)*no_of_classes
    return weights_for_samples

freq = df.value_counts()
print(freq)
ID
A     3
X     2
Y     1

weights = get_weights_inverse_num_of_samples(freq)
print(weights)
[0.54545455 0.81818182 1.63636364]

Run Code Online (Sandbox Code Playgroud)

因此，我正在寻找一种有效的方法来获取这样的数据帧，给定上述权重：

   ID  sample_weight
0  A   0.54545455
1  A   0.54545455
2  A   0.54545455
3  X   0.81818182
4  X   0.81818182
5  Y   1.63636364

Run Code Online (Sandbox Code Playgroud)

Answer 1

Cam*_*ell 7

如果您更多地依赖鸭子类型，您可以重写函数以返回与输出相同的输入类型。

.index这将使您无需在调用之前显式返回.map

import pandas as pd

df = pd.DataFrame({'ID': ['A', 'A', 'A', 'X', 'X', 'Y'})

def get_weights_inverse_num_of_samples(label_counts, power=1):
    """Using object methods here instead of coercing to numpy ndarray"""

    no_of_classes = len(label_counts)
    weights_for_samples = 1 / (label_counts ** power)
    return weights_for_samples / weights_for_samples.sum() * no_of_classes

# select the column before using `.value_counts()`
#   this saves us from ending up with a `MultiIndex` Series
freq = df['ID'].value_counts() 

weights = get_weights_inverse_num_of_samples(freq)

print(weights)
# A    0.545455
# X    0.818182
# Y    1.636364

# note that now our weights are still a `pd.Series` 
#  that we can align directly against our `"ID"` column

df['sample_weight'] = df['ID'].map(weights)

print(df)
#   ID  sample_weight
# 0  A       0.545455
# 1  A       0.545455
# 2  A       0.545455
# 3  X       0.818182
# 4  X       0.818182
# 5  Y       1.636364

Run Code Online (Sandbox Code Playgroud)

呱呱！+1 (2认同)

Answer 2

moz*_*way 6

您可以使用map以下值：

df['sample_weight'] = df['ID'].map(dict(zip(freq.index.get_level_values(0), weights)))

Run Code Online (Sandbox Code Playgroud)

注意。value_counts返回具有单个级别的 MultiIndex，因此需要get_level_values.

正如@ScottBoston 所指出的，更好的方法是使用：

freq = df['ID'].value_counts()

df['sample_weight'] = df['ID'].map(dict(zip(freq.index, weights)))

Run Code Online (Sandbox Code Playgroud)

输出：

  ID  sample_weight
0  A       0.545455
1  A       0.545455
2  A       0.545455
3  X       0.818182
4  X       0.818182
5  Y       1.636364

Run Code Online (Sandbox Code Playgroud)

我正要发布同样的解决方案，我花了额外的几秒钟来解释为什么我们首先有一个多索引。+1 `dict(zip(df['ID'].value_counts().index, 权重))` (3认同)
@斯科特波士顿。感谢您的回答，我明白为什么。尝试“导入 dis;” dis.dis(df.value_counts); dis.dis(df['ID'].value_counts)` (3认同)

归档时间：	3 年，1 月前
查看次数：	374 次
最近记录：	3 年，1 月前