Shi*_*_90 12 python dataframe pandas
我有一个数据框,其中记录了19717人通过多项选择题对编程语言的选择的回答。第一栏当然是受访者的性别,其余则是他们选择的选项。因此,如果我选择Python,则我的响应将记录在Python列中,而不是bash,反之亦然。
ID Gender Python Bash R JavaScript C++
0 Male Python nan nan JavaScript nan
1 Female nan nan R JavaScript C++
2 Prefer not to say Python Bash nan nan nan
3 Male nan nan nan nan nan
Run Code Online (Sandbox Code Playgroud)
我想要的是一个表,该表返回Gender记录下每个类别的实例数。因此,如果用Python用Python编码的5000名男性和用JS编码的女性3000名,那么我应该得到:
Gender Python Bash R JavaScript C++
Male 5000 1000 800 1500 1000
Female 4000 500 1500 3000 800
Prefer Not To Say 2000 ... ... ... 860
Run Code Online (Sandbox Code Playgroud)
我尝试了一些选项:
df.iloc[:, [*range(0, 13)]].stack().value_counts()
Male 16138
Python 12841
SQL 6532
R 4588
Female 3212
Java 2267
C++ 2256
Javascript 2174
Bash 2037
C 1672
MATLAB 1516
Other 1148
TypeScript 389
Prefer not to say 318
None 83
Prefer to self-describe 49
dtype: int64
Run Code Online (Sandbox Code Playgroud)
如上所述,这不是必需的。可以在熊猫里做吗?
另一个想法是沿轴1的值,然后:apply joinget_dummiesgroupby
(df.loc[:, 'Python':]
.apply(lambda x: '|'.join(x.dropna()), axis=1)
.str.get_dummies('|')
.groupby(df['Gender']).sum())
Run Code Online (Sandbox Code Playgroud)
[出]
Bash C++ JavaScript Python R
Gender
Female 0 1 1 0 1
Male 0 0 1 1 0
Prefer not to say 1 0 0 1 0
Run Code Online (Sandbox Code Playgroud)
您可以设置Gender为索引和总和:
s = df.set_index('Gender').iloc[:, 1:]
s.eq(s.columns).astype(int).sum(level=0)
Run Code Online (Sandbox Code Playgroud)
输出:
Python Bash R JavaScript C++
Gender
Male 1 0 0 1 0
Female 0 0 1 1 1
Prefer not to say 1 1 0 0 0
Run Code Online (Sandbox Code Playgroud)
您可以melt并使用crosstab
df1 = pd.melt(df,id_vars=['ID','Gender'],var_name='Language',value_name='Choice')
df1['Choice'] = np.where(df1['Choice'] == df1['Language'],1,0)
final= pd.crosstab(df1['Gender'],df1['Language'],values=df1['Choice'],aggfunc='sum')
print(final)
Language Bash C++ JavaScript Python R
Gender
Female 0 1 1 0 1
Male 0 0 1 1 0
Prefer not to say 1 0 0 1 0
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
258 次 |
| 最近记录: |