根据给定的分布对数据帧进行采样

sta*_*kit 9 python pandas graphlab sframe

如何基于给定的类\标签分布值对pandas数据帧或graphlab sframe进行采样,例如:我想对具有label\class列的数据帧进行采样,以选择行,使得每个类标签被均等地提取,从而具有相似的频率对于每个类标签,对应于类标签的均匀分布.或者最好是根据我们想要的班级分布来获取样本.

+------+-------+-------+
| col1 | clol2 | class |
+------+-------+-------+
| 4    | 45    | A     |
+------+-------+-------+
| 5    | 66    | B     |
+------+-------+-------+
| 5    | 6     | C     |
+------+-------+-------+
| 4    | 6     | C     |
+------+-------+-------+
| 321  | 1     | A     |
+------+-------+-------+
| 32   | 432   | B     |
+------+-------+-------+
| 5    | 3     | B     |
+------+-------+-------+

given a huge dataframe like above and the required frequency distribution like below:
+-------+--------------+
| class | nostoextract |
+-------+--------------+
| A     | 2            |
+-------+--------------+
| B     | 2            |
+-------+--------------+
| C     | 2            |
+-------+--------------+


以上应基于第二帧中的给定频率分布从第一数据帧中提取行,其中频率计数值在nostoextract列中给出,以给出采样帧,其中每个类最多出现2次.如果找不到足够的课程来满足所需的数量,应该忽略并继续.结果数据帧将用于基于决策树的分类器.

正如评论员所说,采样数据帧必须包含nostoextract对应类的不同实例?除非给定类没有足够的示例,否则您只需要使用所有可用的类.

Tho*_*ber 5

您能否将第一个数据帧拆分为特定于类的子数据帧,然后随意从中采样?

IE

dfa = df[df['class']=='A']
dfb = df[df['class']=='B']
dfc = df[df['class']=='C']
....
Run Code Online (Sandbox Code Playgroud)

然后在 dfa、dfb、dfc 上拆分/创建/过滤后,根据需要从顶部选择一个数字(如果数据帧没有任何特定的排序模式)

 dfasamplefive = dfa[:5]
Run Code Online (Sandbox Code Playgroud)

或者使用前面评论者描述的sample方法直接随机抽取一个样本:

dfasamplefive = dfa.sample(n=5)
Run Code Online (Sandbox Code Playgroud)

如果这满足您的需求,剩下要做的就是自动化该过程,输入要从您拥有的控制数据帧中采样的数字,作为包含所需样本数量的第二个数据帧。


swe*_*zel 4

我认为这会解决你的问题:

import pandas as pd

data = pd.DataFrame({'cols1':[4, 5, 5, 4, 321, 32, 5],
                     'clol2':[45, 66, 6, 6, 1, 432, 3],
                     'class':['A', 'B', 'C', 'C', 'A', 'B', 'B']})

freq = pd.DataFrame({'class':['A', 'B', 'C'],
                     'nostoextract':[2, 2, 2], })

def bootstrap(data, freq):
    freq = freq.set_index('class')

    # This function will be applied on each group of instances of the same
    # class in `data`.
    def sampleClass(classgroup):
        cls = classgroup['class'].iloc[0]
        nDesired = freq.nostoextract[cls]
        nRows = len(classgroup)

        nSamples = min(nRows, nDesired)
        return classgroup.sample(nSamples)

    samples = data.groupby('class').apply(sampleClass)

    # If you want a new index with ascending values
    # samples.index = range(len(samples))

    # If you want an index which is equal to the row in `data` where the sample
    # came from
    samples.index = samples.index.get_level_values(1)

    # If you don't change it then you'll have a multiindex with level 0
    # being the class and level 1 being the row in `data` where
    # the sample came from.

    return samples

print(bootstrap(data,freq))
Run Code Online (Sandbox Code Playgroud)

印刷:

  class  clol2  cols1
0     A     45      4
4     A      1    321
1     B     66      5
5     B    432     32
3     C      6      4
2     C      6      5
Run Code Online (Sandbox Code Playgroud)

如果你不希望结果按类排序,你可以最后对其进行排列。