arm*_*min 6 python dataframe pandas
我正在与一家前大型数据公司合作。这是一个示例:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'ID': ['A', 'A', 'A', 'X', 'X', 'Y'],
})
ID
0 A
1 A
2 A
3 X
4 X
5 Y
Run Code Online (Sandbox Code Playgroud)
现在,给定“ID”列中每个值的频率,我想使用下面的函数计算权重,并添加一个具有与“ID”中每个值关联的权重的列。
def get_weights_inverse_num_of_samples(label_counts, power=1.):
no_of_classes = len(label_counts)
weights_for_samples = 1.0/np.power(np.array(label_counts), power)
weights_for_samples = weights_for_samples/ np.sum(weights_for_samples)*no_of_classes
return weights_for_samples
freq = df.value_counts()
print(freq)
ID
A 3
X 2
Y 1
weights = get_weights_inverse_num_of_samples(freq)
print(weights)
[0.54545455 0.81818182 1.63636364]
Run Code Online (Sandbox Code Playgroud)
因此,我正在寻找一种有效的方法来获取这样的数据帧,给定上述权重:
ID sample_weight
0 A 0.54545455
1 A 0.54545455
2 A 0.54545455
3 X 0.81818182
4 X 0.81818182
5 Y 1.63636364
Run Code Online (Sandbox Code Playgroud)
如果您更多地依赖鸭子类型,您可以重写函数以返回与输出相同的输入类型。
.index这将使您无需在调用之前显式返回.map
import pandas as pd
df = pd.DataFrame({'ID': ['A', 'A', 'A', 'X', 'X', 'Y'})
def get_weights_inverse_num_of_samples(label_counts, power=1):
"""Using object methods here instead of coercing to numpy ndarray"""
no_of_classes = len(label_counts)
weights_for_samples = 1 / (label_counts ** power)
return weights_for_samples / weights_for_samples.sum() * no_of_classes
# select the column before using `.value_counts()`
# this saves us from ending up with a `MultiIndex` Series
freq = df['ID'].value_counts()
weights = get_weights_inverse_num_of_samples(freq)
print(weights)
# A 0.545455
# X 0.818182
# Y 1.636364
# note that now our weights are still a `pd.Series`
# that we can align directly against our `"ID"` column
df['sample_weight'] = df['ID'].map(weights)
print(df)
# ID sample_weight
# 0 A 0.545455
# 1 A 0.545455
# 2 A 0.545455
# 3 X 0.818182
# 4 X 0.818182
# 5 Y 1.636364
Run Code Online (Sandbox Code Playgroud)
您可以使用map以下值:
df['sample_weight'] = df['ID'].map(dict(zip(freq.index.get_level_values(0), weights)))
Run Code Online (Sandbox Code Playgroud)
注意。value_counts返回具有单个级别的 MultiIndex,因此需要get_level_values.
正如@ScottBoston 所指出的,更好的方法是使用:
freq = df['ID'].value_counts()
df['sample_weight'] = df['ID'].map(dict(zip(freq.index, weights)))
Run Code Online (Sandbox Code Playgroud)
输出:
ID sample_weight
0 A 0.545455
1 A 0.545455
2 A 0.545455
3 X 0.818182
4 X 0.818182
5 Y 1.636364
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
374 次 |
| 最近记录: |