Pandas 跨多列进行代表性采样

Question

Pandas 跨多列进行代表性采样

我有一个代表人口的数据框，每一列表示该人的不同品质/特征。我怎样才能获得该数据框/人口的样本，它代表了整个人口的所有特征。

假设我有一个代表 650 人的员工队伍的数据框，如下所示：

import pandas as pd
import numpy as np
c = np.random.choice

colours = ['blue', 'yellow', 'green', 'green... no, blue']
knights = ['Bedevere', 'Galahad', 'Arthur', 'Robin', 'Lancelot']
qualities = ['wise', 'brave', 'pure', 'not quite so brave']

df = pd.DataFrame({'name_id':c(range(3000), 650, replace=False),
              'favourite_colour':c(colours, 650),
              'favourite_knight':c(knights, 650),
              'favourite_quality':c(qualities, 650)})

Run Code Online (Sandbox Code Playgroud)

我可以获得上面的一个样本，反映单列的分布，如下所示：

# Find the distribution of a particular column using value_counts and normalize:
knight_weight = df['favourite_knight'].value_counts(normalize=True)

# Add this to my dataframe as a weights column:
df['knight_weight'] = df['favourite_knight'].apply(lambda x: knight_weight[x])

# Then sample my dataframe using the weights column I just added as the 'weights' argument:
df_sample = df.sample(140, weights=df['knight_weight'])

Run Code Online (Sandbox Code Playgroud)

这将返回一个示例数据帧（df_sample），使得：

df_sample['favourite_knight'].value_counts(normalize=True)
is approximately equal to
df['favourite_knight'].value_counts(normalize=True)

Run Code Online (Sandbox Code Playgroud)

我的问题是：如何生成示例数据帧（df_sample），以便上面的内容即：

df_sample[column].value_counts(normalize=True)
is approximately equal to
df[column].value_counts(normalize=True)

Run Code Online (Sandbox Code Playgroud)

对于所有列（“name_id”除外）都适用，而不仅仅是其中一列？人口为 650 人，样本量为 140 人，大约是我正在处理的规模，因此性能并不是太大的问题。我很乐意接受需要几分钟才能运行的解决方案，因为这仍然比手动生成上述示例要快得多。感谢您的任何帮助。

Answer 1

Pat*_*ner 6

您创建一个组合特征列，对其进行加权并用其作为权重进行绘制：

df["combined"] = list(zip(df["favourite_colour"],
                          df["favourite_knight"],
                          df["favourite_quality"]))

combined_weight = df['combined'].value_counts(normalize=True)

df['combined_weight'] = df['combined'].apply(lambda x: combined_weight[x])

df_sample = df.sample(140, weights=df['combined_weight'])

Run Code Online (Sandbox Code Playgroud)

这将需要一个额外的步骤，即除以特定重量的计数，因此总和为 1 - 请参阅Ehsan Fathi帖子。

归档时间：	5 年，3 月前
查看次数：	3260 次
最近记录：	3 年，7 月前