Shi*_*ith 2 python-3.x pandas train-test-split
我有一个数据框如下
df = pd.DataFrame({"Col1": ['A','B','B','A','B','B','A','B','A', 'A'],
"Col2" : [-2.21,-9.59,0.16,1.29,-31.92,-24.48,15.23,34.58,24.33,-3.32],
"Col3" : [-0.27,-0.57,0.072,-0.15,-0.21,-2.54,-1.06,1.94,1.83,0.72],
"y" : [-1,1,-1,-1,-1,1,1,1,1,-1]})
Col1 Col2 Col3 y
0 A -2.21 -0.270 -1
1 B -9.59 -0.570 1
2 B 0.16 0.072 -1
3 A 1.29 -0.150 -1
4 B -31.92 -0.210 -1
5 B -24.48 -2.540 1
6 A 15.23 -1.060 1
7 B 34.58 1.940 1
8 A 24.33 1.830 1
9 A -3.32 0.720 -1
Run Code Online (Sandbox Code Playgroud)
有没有办法分割数据帧(60:40 分割),以便每组的前 60% 的值Col1将被训练,最后 40% 的值将被测试。
火车 :
Col1 Col2 Col3 y
0 A -2.21 -0.270 -1
1 B -9.59 -0.570 1
2 B 0.16 0.072 -1
3 A 1.29 -0.150 -1
4 B -31.92 -0.210 -1
6 A 15.23 -1.060 1
Run Code Online (Sandbox Code Playgroud)
测试:
Col1 Col2 Col3 y
5 B -24.48 -2.540 1
7 B 34.58 1.940 1
8 A 24.33 1.830 1
9 A -3.32 0.720 -1
Run Code Online (Sandbox Code Playgroud)
我觉得你需要groupby这里
s=df.groupby('Col1').Col1.cumcount()#get the count for each group
s=s//(df.groupby('Col1').Col1.transform('count')*0.6).astype(int)# get the top 60% of each group
Train=df.loc[s==0].copy()
Test=df.drop(Train.index)
Train
Out[118]:
Col1 Col2 Col3 y
0 A -2.21 -0.270 -1
1 B -9.59 -0.570 1
2 B 0.16 0.072 -1
3 A 1.29 -0.150 -1
4 B -31.92 -0.210 -1
6 A 15.23 -1.060 1
Test
Out[119]:
Col1 Col2 Col3 y
5 B -24.48 -2.54 1
7 B 34.58 1.94 1
8 A 24.33 1.83 1
9 A -3.32 0.72 -1
Run Code Online (Sandbox Code Playgroud)
如果需要不分组拆分:
thresh = int(len(df) * 0.6)
train = df.iloc[:thresh]
test = df.iloc[thresh:]
print(train)
Col1 Col2 Col3 y
0 A -2.21 -0.270 -1
1 B -9.59 -0.570 1
2 B 0.16 0.072 -1
3 A 1.29 -0.150 -1
4 B -31.92 -0.210 -1
5 B -24.48 -2.540 1
print(test)
Col1 Col2 Col3 y
6 A 15.23 -1.06 1
7 B 34.58 1.94 1
8 A 24.33 1.83 1
9 A -3.32 0.72 -1
Run Code Online (Sandbox Code Playgroud)
编辑:如果需要按组拆分,则创建阈值并GroupBy.cumcount进行过滤:
thresh = int(len(df) * 0.6 / df['Col1'].nunique())
print (thresh)
3
mask = df.groupby('Col1')['Col1'].cumcount() < thresh
train = df[mask]
test = df[~mask]
print(train)
Col1 Col2 Col3 y
0 A -2.21 -0.270 -1
1 B -9.59 -0.570 1
2 B 0.16 0.072 -1
3 A 1.29 -0.150 -1
4 B -31.92 -0.210 -1
6 A 15.23 -1.060 1
print(test)
Col1 Col2 Col3 y
5 B -24.48 -2.54 1
7 B 34.58 1.94 1
8 A 24.33 1.83 1
9 A -3.32 0.72 -1
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
2388 次 |
| 最近记录: |