Python：使用类别和标记大小绘制散点图

Nip*_*tra 6 python matplotlib ggplot2 pandas seaborn

我正在尝试用 Python从这里选择一个 R ggplot2 图。我正在查看相关散点图，如下所示。

导入数据

import pandas as pd
midwest= pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/midwest.csv")

Run Code Online (Sandbox Code Playgroud)

默认 Pandas 散点图

midwest.plot(kind='scatter', x='area', y='poptotal', ylim=((0, 50000)), xlim=((0., 0.1)))

Run Code Online (Sandbox Code Playgroud)

上面的代码本身不会对不同的类别进行颜色编码，而是如下所示。

Pandas Groupby + 散点图

但是，我们可以按“状态”对数据框进行分组，然后为每个组（ref）单独绘制散点图。

fig, ax = plt.subplots()
groups = midwest.groupby('state')
for name, group in groups:
    ax.plot(group.area, group.poptotal, marker='o', linestyle='', ms=10, 
label=name)
ax.legend(numpoints=1)
ax.set_ylim((0, 500000))

Run Code Online (Sandbox Code Playgroud)

虽然这确实让我们在散点图中得到了不同的类别，但它并没有让它们的大小增加popdensity.

Seaborn 配对图

 import seaborn as sns
sns.pairplot(x_vars=["area"], y_vars=["poptotal"], data=midwest, 
hue="state", size=5)
plt.gca().set_ylim((0, 50000))

Run Code Online (Sandbox Code Playgroud)

同样，这仅按类别绘制散点图。但是，我们仍然没有标记大小popdensity

Matplotlib

下面是我们如何深入到每个数据点并在 Matplotlib 中绘制绘图。

fig, ax = plt.subplots()
groups = midwest.groupby('state')
min_popdensity, max_popdensity = midwest['popdensity'].min(), midwest['popdensity'].max()
for name, group in groups:
    for data_point in group.itertuples():
        ax.plot(data_point.area, data_point.poptotal, marker='o', linestyle='', 
                   ms=1+12*((max_popdensity-data_point.popdensity)/(max_popdensity-min_popdensity)), label=name)
ax.set_ylim((0, 500000))

Run Code Online (Sandbox Code Playgroud)

这会产生一个与目标图几乎相似的图。

问题

我们如何在不做popdensity所有繁重工作的情况下根据点获得标记大小（例如单独绘制每个点）？
我们如何添加 ggplot 可视化中显示的平滑线。

附加信息

这是head中西部数据框的。

PID county  state   area    poptotal    popdensity  popwhite    popblack    popamerindian   popasian    ... percollege  percprof    poppovertyknown percpovertyknown    percbelowpoverty    percchildbelowpovert    percadultpoverty    percelderlypoverty  inmetro category
0   561 ADAMS   IL  0.052   66090   1270.961540 63917   1702    98  249 ... 19.631392   4.355859    63628   96.274777   13.151443   18.011717   11.009776   12.443812   0   AAR
1   562 ALEXANDER   IL  0.014   10626   759.000000  7054    3496    19  48  ... 11.243308   2.870315    10529   99.087145   32.244278   45.826514   27.385647   25.228976   0   LHR
2   563 BOND    IL  0.022   14991   681.409091  14477   429 35  16  ... 17.033819   4.488572    14235   94.956974   12.068844   14.036061   10.852090   12.697410   0   AAR
3   564 BOONE   IL  0.017   30806   1812.117650 29344   127 46  150 ... 17.278954   4.197800    30337   98.477569   7.209019    11.179536   5.536013    6.217047    1   ALU
4   565 BROWN   IL  0.018   5836    324.222222  5264    547 14  5   ... 14.475999   3.367680    4815    82.505140   13.520249   13.022889   11.143211   19.200000   0   AAR

Run Code Online (Sandbox Code Playgroud)

而且，这是原始帖子中使用的 ggplot2 代码。

options(scipen=999)  # turn-off scientific notation like 1e+48
library(ggplot2)
theme_set(theme_bw())  # pre-set the bw theme.
data("midwest", package = "ggplot2")


# Scatterplot
gg <- ggplot(midwest, aes(x=area, y=poptotal)) + 
  geom_point(aes(col=state, size=popdensity)) + 
  geom_smooth(method="loess", se=F) + 
  xlim(c(0, 0.1)) + 
  ylim(c(0, 500000)) + 
  labs(subtitle="Area Vs Population", 
       y="Population", 
       x="Area", 
       title="Scatterplot", 
       caption = "Source: midwest")

plot(gg)

Run Code Online (Sandbox Code Playgroud)

编辑

我不知道该问题是否会重新打开（标记为重复）。与此同时，这里有一个 Pandas 唯一的答案，效果相当好。

fig, ax = plt.subplots()
groups = midwest.groupby('state')
colors = ['b','g','r','y','k']
for i, (name, group) in enumerate(groups):
    group.plot(kind='scatter', x='area', y='poptotal', ylim=((0, 50000)), xlim=((0., 0.1)), s=10+group['popdensity']*0.01, label=name, ax=ax, color=colors[i])
lgd = ax.legend(numpoints=1)
for handle in lgd.legendHandles:
    handle.set_sizes([100.0])
ax.set_ylim((0, 500000))

Run Code Online (Sandbox Code Playgroud)

编辑 2

/sf/users/259532521/在评论中提到的以下答案似乎使用 seaborn 解决了问题。

sizes = [10, 40, 70, 100, 130] 
marker_size = pd.cut(4*midwest['popdensity'], [0, 20000, 40000, 60000, 80000, 1000000], labels=sizes) 
sns.lmplot('area', 'poptotal', data=midwest, hue='state', fit_reg=False, scatter_kws={'s':marker_size})
plt.ylim((0, 500000))

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，4 月前
查看次数：	8686 次
最近记录：	8 年，4 月前