假设我想从Pandas中的数据框中执行分层样本,以便5%为给定列的每个值获取行.我怎样才能做到这一点?
例如,在下面的数据框中,我想要5%对与列的每个值相关联的行进行采样Z.有没有办法从内存中加载的数据帧中对组进行采样?
> df
X Y Z
1 123 a
2 89 b
1 234 a
4 893 a
6 234 b
2 893 b
3 200 c
5 583 c
2 583 c
6 100 c
Run Code Online (Sandbox Code Playgroud)
更一般地说,如果我在磁盘中的这个数据帧在一个巨大的文件中(例如8 GB的csv文件)怎么办?有没有办法在不必将整个数据帧加载到内存中的情况下进行此采样?
使用“usecols”选项仅将“Z”列加载到内存中怎么样?假设该文件是sample.csv。如果你有一堆列,那么使用的内存应该少得多。然后假设这符合记忆,我认为这对你有用。
stratfraction = 0.05
#Load only the Z column
df = pd.read_csv('sample.csv', usecols = ['Z'])
#Generate the counts per value of Z
df['Obs'] = 1
gp = df.groupby('Z')
#Get number of samples per group
df2 = np.ceil(gp.count()*stratfraction)
#Generate the indices of the request sample (first entrie)
stratsample = []
for i, key in enumerate(gp.groups):
FirstFracEntries = gp.groups[key][0:int(df2['Obs'][i])]
stratsample.extend(FirstFracEntries)
#Generate a list of rows to skip since read_csv doesn't have a rows to keep option
stratsample.sort
RowsToSkip = set(df.index.values).difference(stratsample)
#Load only the requested rows (no idea how well this works for a really giant list though)
df3 = df = pd.read_csv('sample.csv', skiprows = RowsToSkip)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
2566 次 |
| 最近记录: |