堆积条形图由群计数在熊猫Python上

Ace*_*.py 0 python plot numpy matplotlib pandas

我的csv数据看起来像下面提供的那样.我想用pandas/python创建一个堆栈条形图,其中每个条形图代表有两种颜色的男性和女性部分,在条形图的顶部显示服用该药物的男性和女性的总数(在我的情况下).例如,对于20岁的人来说,总共7人,其中6人是男性,1人是女性,所以在酒吧的情况下,酒吧的顶部应该有7个,这个6:1的部分显示在酒吧中.两种颜色.我设法根据他们的年龄计划对人们进行分组并绘制它,但我想要显示具有不同颜色的两种性别的酒吧.任何帮助将不胜感激 .谢谢.

Drug_ID,Age,Gender
FF90E5F780133BBCAEE9BA0E4654C5CA,15,F
2AB2529352E6A1D0BA91E167C5191231,17,M
6E4B2C80EA83C5B51675BC077A182E06,19,M
8AEB19A68DE4C301154973E0C87C774D,19,M
FA533E7DD1FCAB83822D4A2FC514AEE7,20,F
208F39485A8FBA7214CA004B53CC29B7,20,M
3172B95E8A5732D2EDB089A354787612,20,M
38FBAE52AAB04E56AB3A35C4AFBF3813,20,M
62D8A33130574C984EAF3A2E80C243A6,20,M
646AF35E192885FE1336649BA3735FC4,20,M
778D2B1354F30ED3D3BDD7B9437CF670,20,M
891312933FE5FB25703B3E958EF943E3,21,F
5CC63DFC5EF399F65CB9BC583A770DFB,21,M
97C9ED5750EC6FF618A059085F0BF4A6,21,M
C6F03C5E3CC37249E0308A09292F5A58,21,M
0713C427BDC3A90F291AF49599987F38,23,F
518EBD35FCDAB7C744334F993D8CEB28,23,F
A3FC185C4531AFF1E8B7B9F5985028CB,23,F
1837406921314CB182FB0C7BC5565204,23,M
CA0E6274BD39E8DE5C823F4E6F234252,23,M
E2F64030BB011C11541EDC192BAED09A,23,M
EF8F3D028C5759858AB7574864833015,23,M
5DCD427F26E05CC1C3F565BB05EAE10B,24,F
8646ED503722C3C6C6B44208EF1A5202,24,F
F9F45112C472282778E1F437F54B0B70,24,F
0CF4F20B89303CB4C03CF3BD3B27CAF0,24,M
63727039DFF77A46995DA8EDBC4E3497,24,M

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

data = pd.read_csv('/home/MedResearch/Desktop/drug_ids_age_gender.csv')
data.values

df = pd.DataFrame(data)
df2 = pd.merge(df1,df,  left_index = True, right_index = True)
temp1 = df2.groupby('Age').Age.count()

df3 = pd.merge(df1,df,  left_index = True, right_index = True)
temp2 = df3.groupby('Gender').Age.count()

ax = temp1.plot(kind='bar',stacked = False, colormap = 'Paired')
for p in ax.patches:
    ax.annotate(np.round(p.get_height(),decimals=0), (p.get_x()+p.get_width()/2.,   p.get_height()), ha='center', va='center', xytext=(2, 10), textcoords='offset points')
plt.ylabel('Numbers Of Patients Taking the drug')
plt.title('Age Distribution Of Patients Based Upon drug Intake')
plt.show()
Run Code Online (Sandbox Code Playgroud)

得到这样的结果:

在此输入图像描述

Diz*_*ahi 5

这个问题经常回来,所以我决定写一步一步的解释.请注意,我不是一个pandas大师,所以有些事情可能会被优化.

我开始生成一个我将用于x轴的年龄列表:

cvsdata = '''Drug_ID,Age,Gender
FF90E5F780133BBCAEE9BA0E4654C5CA,15,F
2AB2529352E6A1D0BA91E167C5191231,17,M
6E4B2C80EA83C5B51675BC077A182E06,19,M
8AEB19A68DE4C301154973E0C87C774D,19,M
FA533E7DD1FCAB83822D4A2FC514AEE7,20,F
208F39485A8FBA7214CA004B53CC29B7,20,M
3172B95E8A5732D2EDB089A354787612,20,M
38FBAE52AAB04E56AB3A35C4AFBF3813,20,M
62D8A33130574C984EAF3A2E80C243A6,20,M
646AF35E192885FE1336649BA3735FC4,20,M
778D2B1354F30ED3D3BDD7B9437CF670,20,M
891312933FE5FB25703B3E958EF943E3,21,F
5CC63DFC5EF399F65CB9BC583A770DFB,21,M
97C9ED5750EC6FF618A059085F0BF4A6,21,M
C6F03C5E3CC37249E0308A09292F5A58,21,M
0713C427BDC3A90F291AF49599987F38,23,F
518EBD35FCDAB7C744334F993D8CEB28,23,F
A3FC185C4531AFF1E8B7B9F5985028CB,23,F
1837406921314CB182FB0C7BC5565204,23,M
CA0E6274BD39E8DE5C823F4E6F234252,23,M
E2F64030BB011C11541EDC192BAED09A,23,M
EF8F3D028C5759858AB7574864833015,23,M
5DCD427F26E05CC1C3F565BB05EAE10B,24,F
8646ED503722C3C6C6B44208EF1A5202,24,F
F9F45112C472282778E1F437F54B0B70,24,F
0CF4F20B89303CB4C03CF3BD3B27CAF0,24,M
63727039DFF77A46995DA8EDBC4E3497,24,M'''

df = pd.read_csv(StringIO.StringIO(cvsdata))
ages = df.Age.unique()

array([15, 17, 19, 20, 21, 23, 24])
Run Code Online (Sandbox Code Playgroud)

然后我生成了一个分组的数据框,其中包含每个年龄的每个M和F的计数:

counts = df.groupby(['Age','Gender']).count()
print counts

            Drug_ID
Age Gender         
15  F             1
17  M             1
19  M             2
20  F             1
    M             6
21  F             1
    M             3
23  F             3
    M             4
24  F             3
    M             2
Run Code Online (Sandbox Code Playgroud)

使用它,我可以轻松计算每个年龄组的个人总数:

totals = counts.sum(level=0)
print totals

     Drug_ID
Age         
15         1
17         1
19         2
20         7
21         4
23         7
24         5
Run Code Online (Sandbox Code Playgroud)

为了准备绘图,我将转换我的counts数据框,按列而不是索引分隔每个性别.在这里,我还删除了'Drug_ID'列名,因为该unstack()操作创建了一个MultiIndex,并且在没有MultiIndex的情况下操作数据帧要容易得多.

counts = counts.unstack(level=1)
counts.columns = counts.columns.droplevel(level=0)
print counts

Gender    F    M
Age             
15      1.0  NaN
17      NaN  1.0
19      NaN  2.0
20      1.0  6.0
21      1.0  3.0
23      3.0  4.0
24      3.0  2.0
Run Code Online (Sandbox Code Playgroud)

看起来很不错.我将进行最后的改进并替换为NaN0.

counts = counts.fillna(0)
print counts

Gender    F    M
Age             
15      1.0  0.0
17      0.0  1.0
19      0.0  2.0
20      1.0  6.0
21      1.0  3.0
23      3.0  4.0
24      3.0  2.0
Run Code Online (Sandbox Code Playgroud)

使用此数据框,绘制堆积条形图很简单:

plt.bar(ages, counts['M'], bottom=None, color='blue', label='M')
plt.bar(ages, counts['F'], bottom=counts['M'], color='pink', label='F')
plt.legend()
plt.xlabel('Ages')
plt.ylabel('Count')
Run Code Online (Sandbox Code Playgroud)

要在条形图上绘制总计数,我们将使用该annotate()函数.我们不能在一次通过中完成它,而是我们将循环通过agestotals(为了简单起见,我采取valuesflatten()他们因为他们不是正确的格式,不完全确定为什么在这里)

for age,tot in zip(ages,totals.values.flatten()):
    plt.annotate('N={:d}'.format(tot), xy=(age+0.4, tot), xytext=(0,5), textcoords='offset points', ha='center', va='bottom')
Run Code Online (Sandbox Code Playgroud)

注释的坐标是(age+0.4, tot)因为条形图默认为x从而变为x+widthwith width=0.8,因此x+0.4是条形图的中心tot,当然是条形图的整个高度.为了稍微偏移文本,我将文本在y方向上偏移了几(5)个点.根据自己的喜好调整.

查看文档bar()以调整条形图的参数.查看文档annotate()以自定义注释

在此输入图像描述