如何基于给定的类\标签分布值对pandas数据帧或graphlab sframe进行采样,例如:我想对具有label\class列的数据帧进行采样,以选择行,使得每个类标签被均等地提取,从而具有相似的频率对于每个类标签,对应于类标签的均匀分布.或者最好是根据我们想要的班级分布来获取样本.
+------+-------+-------+ | col1 | clol2 | class | +------+-------+-------+ | 4 | 45 | A | +------+-------+-------+ | 5 | 66 | B | +------+-------+-------+ | 5 | 6 | C | +------+-------+-------+ | 4 | 6 | C | +------+-------+-------+ | 321 | 1 | A | +------+-------+-------+ | 32 | 432 | B | +------+-------+-------+ | 5 | 3 | B | +------+-------+-------+ given a huge dataframe like above and the required frequency distribution like below: …
考虑以下代码
one, two = sales.random_split(0.5, seed=0)
set_1, set_2 = one.random_split(0.5, seed=0)
set_3, set_4 = two.random_split(0.5, seed=0)
Run Code Online (Sandbox Code Playgroud)
我在这段代码中尝试的是将Sales Sframe中的数据(类似于Pandas DataFrame)随机分成大约4个相等的部分.
什么是Pythonic/Efficient方法来实现这一目标?