是否有pandas Series(pandas.Series.query())的查询方法或类似方法?

dme*_*meu 17 python series method-chaining dataframe pandas

pandas.DataFrame.query()方法在加载或绘图时非常适用于(前/后)过滤数据.它对于方法链尤特别方便.

我发现自己经常想要将相同的逻辑应用于a pandas.Series,例如在完成了df.value_counts返回a之类的方法之后pandas.Series.

让我们假设有一个巨大的列表,Player, Game, Points我想绘制超过14次3分的玩家直方图.我首先必须总结每个球员(groupby -> agg)的积分,这将返回一系列约1000名球员及其总得分.应用.query逻辑它看起来像这样:

df = pd.DataFrame({
    'Points': [random.choice([1,3]) for x in range(100)], 
    'Player': [random.choice(["A","B","C"]) for x in range(100)]})

(df
     .query("Points == 3")
     .Player.values_count()
     .query("> 14")
     .hist())
Run Code Online (Sandbox Code Playgroud)

我找到的唯一解决方案迫使我做一个不必要的任务并打破方法链:

(points_series = df
     .query("Points == 3")
     .groupby("Player").size()
points_series[points_series > 100].hist()
Run Code Online (Sandbox Code Playgroud)

方法链接以及查询方法有助于保持代码清晰,同时子集化过滤可以很快变得混乱.

# just to make my point :)
series_bestplayers_under_100[series_prefiltered_under_100 > 0].shape
Run Code Online (Sandbox Code Playgroud)

请帮助我摆脱困境!谢谢

jez*_*ael 9

您可以添加IIUC query("Points > 100"):

df = pd.DataFrame({'Points':[50,20,38,90,0, np.Inf],
                   'Player':['a','a','a','s','s','s']})

print (df)
  Player     Points
0      a  50.000000
1      a  20.000000
2      a  38.000000
3      s  90.000000
4      s   0.000000
5      s        inf

points_series = df.query("Points < inf").groupby("Player").agg({"Points": "sum"})['Points']
print (points_series)     
a = points_series[points_series > 100]
print (a)     
Player
a    108.0
Name: Points, dtype: float64


points_series = df.query("Points < inf")
                  .groupby("Player")
                  .agg({"Points": "sum"})
                  .query("Points > 100")

print (points_series)     
        Points
Player        
a        108.0
Run Code Online (Sandbox Code Playgroud)

另一个解决方案是Select By Callable:

points_series = df.query("Points < inf")
                  .groupby("Player")
                  .agg({"Points": "sum"})['Points']
                  .loc[lambda x: x > 100]

print (points_series)     
Player
a    108.0
Name: Points, dtype: float64
Run Code Online (Sandbox Code Playgroud)

编辑问题编辑的答案:

np.random.seed(1234)
df = pd.DataFrame({
    'Points': [np.random.choice([1,3]) for x in range(100)], 
    'Player': [np.random.choice(["A","B","C"]) for x in range(100)]})

print (df.query("Points == 3").Player.value_counts().loc[lambda x: x > 15])
C    19
B    16
Name: Player, dtype: int64

print (df.query("Points == 3").groupby("Player").size().loc[lambda x: x > 15])
Player
B    16
C    19
dtype: int64
Run Code Online (Sandbox Code Playgroud)


Mar*_*tin 5

为什么不从Series转换为DataFrame,执行查询,然后再转换回。

df["Points"] = df["Points"].to_frame().query('Points > 100')["Points"]
Run Code Online (Sandbox Code Playgroud)

在这里,.to_frame()转换为DataFrame,而尾随["Points"]转换为Series。

.query()然后,无论Pandas对象是否具有1或更多列,都可以一致地使用该方法。


Ily*_*kin 5

您可以使用pipe以下命令代替查询:

s.pipe(lambda x: x[x>0]).pipe(lambda x: x[x<10])
Run Code Online (Sandbox Code Playgroud)