如何在Python Pandas中选择两个值之间的DataFrame中的行?

use*_*983 79 python pandas

我试图修改一个DataFrame df只包含列中的值closing_price介于99和101之间的行,并尝试使用下面的代码执行此操作.

但是,我得到了错误

ValueError:Series的真值是不明确的.使用a.empty,a.bool(),a.item(),a.any()或a.all()

我想知道是否有办法在不使用循环的情况下执行此操作.

df = df[(99 <= df['closing_price'] <= 101)]
Run Code Online (Sandbox Code Playgroud)

Par*_*ait 108

还要考虑以下系列:

df = df[df['closing_price'].between(99, 101)]
Run Code Online (Sandbox Code Playgroud)

  • @dsugasa 例如 `df = df[~df['ending_price']. Between(99, 101)]` (6认同)
  • 选项`inclusive = True`默认在`between`中使用,所以你可以像这样查询```df = df [df ['closing_price'].在(99,101)之间``` (5认同)
  • 这是最好的答案!很好! (3认同)
  • pandas 中是否有“不在之间”功能?我没有找到它。 (3认同)
  • @dsugasa,将[tilde运算符](/sf/ask/3223802291/)与`between`一起使用。 (2认同)
  • 这应该是答案。tnx (2认同)

Jia*_* Li 77

您应该使用()分组布尔向量来消除歧义.

df = df[(df['closing_price'] >= 99) & (df['closing_price'] <= 101)]
Run Code Online (Sandbox Code Playgroud)


Max*_*axU 19

有一个更好的替代方法 - 使用query()方法:

In [58]: df = pd.DataFrame({'closing_price': np.random.randint(95, 105, 10)})

In [59]: df
Out[59]:
   closing_price
0            104
1             99
2             98
3             95
4            103
5            101
6            101
7             99
8             95
9             96

In [60]: df.query('99 <= closing_price <= 101')
Out[60]:
   closing_price
1             99
5            101
6            101
7             99
Run Code Online (Sandbox Code Playgroud)

更新:回答评论:

我喜欢这里的语法,但在尝试与expresison结合时却倒下了; df.query('(mean + 2 *sd) <= closing_price <=(mean + 2 *sd)')

In [161]: qry = "(closing_price.mean() - 2*closing_price.std())" +\
     ...:       " <= closing_price <= " + \
     ...:       "(closing_price.mean() + 2*closing_price.std())"
     ...:

In [162]: df.query(qry)
Out[162]:
   closing_price
0             97
1            101
2             97
3             95
4            100
5             99
6            100
7            101
8             99
9             95
Run Code Online (Sandbox Code Playgroud)

  • @ManojKumar, `df.query('ending_price. Between(99, 101, Include=True)', engine="python")` - 但这与“numexpr”引擎相比会慢一些。 (2认同)

小智 11

你也可以使用.between()方法

emp = pd.read_csv("C:\\py\\programs\\pandas_2\\pandas\\employees.csv")

emp[emp["Salary"].between(60000, 61000)]
Run Code Online (Sandbox Code Playgroud)

输出

在此处输入图片说明


cra*_*WAI 8

newdf = df.query('closing_price.mean() <= closing_price <= closing_price.std()')
Run Code Online (Sandbox Code Playgroud)

或者

mean = closing_price.mean()
std = closing_price.std()

newdf = df.query('@mean <= closing_price <= @std')
Run Code Online (Sandbox Code Playgroud)


nor*_*ius 6

如果必须重复调用(针对不同的边界和),则会不必要地重复大量工作。在这种情况下,对帧/系列进行一次排序然后使用 是有益的。我测得加速高达 25 倍,见下文。pd.Series.between(l,r) lrpd.Series.searchsorted()

\n
def between_indices(x, lower, upper, inclusive=True):\n    """\n    Returns smallest and largest index i for which holds \n    lower <= x[i] <= upper, under the assumption that x is sorted.\n    """\n    i = x.searchsorted(lower, side="left" if inclusive else "right")\n    j = x.searchsorted(upper, side="right" if inclusive else "left")\n    return i, j\n\n# Sort x once before repeated calls of between()\nx = x.sort_values().reset_index(drop=True)\n# x = x.sort_values(ignore_index=True) # for pandas>=1.0\nret1 = between_indices(x, lower=0.1, upper=0.9)\nret2 = between_indices(x, lower=0.2, upper=0.8)\nret3 = ...\n
Run Code Online (Sandbox Code Playgroud)\n
\n

基准

\n

测量重复评估 ( n_reps=100)以及基于不同参数和 的pd.Series.between()方法。在我的 MacBook Pro 2015 上,使用 Python v3.8.0 和 Pandas v1.0.3,以下代码会产生以下输出pd.Series.searchsorted()lowerupper

\n
# pd.Series.searchsorted()\n# 5.87 ms \xc2\xb1 321 \xc2\xb5s per loop (mean \xc2\xb1 std. dev. of 7 runs, 100 loops each)\n# pd.Series.between(lower, upper)\n# 155 ms \xc2\xb1 6.08 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 10 loops each)\n# Logical expressions: (x>=lower) & (x<=upper)\n# 153 ms \xc2\xb1 3.52 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 10 loops each)\n
Run Code Online (Sandbox Code Playgroud)\n
import numpy as np\nimport pandas as pd\n\ndef between_indices(x, lower, upper, inclusive=True):\n    # Assumption: x is sorted.\n    i = x.searchsorted(lower, side="left" if inclusive else "right")\n    j = x.searchsorted(upper, side="right" if inclusive else "left")\n    return i, j\n\ndef between_fast(x, lower, upper, inclusive=True):\n    """\n    Equivalent to pd.Series.between() under the assumption that x is sorted.\n    """\n    i, j = between_indices(x, lower, upper, inclusive)\n    if True:\n        return x.iloc[i:j]\n    else:\n        # Mask creation is slow.\n        mask = np.zeros_like(x, dtype=bool)\n        mask[i:j] = True\n        mask = pd.Series(mask, index=x.index)\n        return x[mask]\n\ndef between(x, lower, upper, inclusive=True):\n    mask = x.between(lower, upper, inclusive=inclusive)\n    return x[mask]\n\ndef between_expr(x, lower, upper, inclusive=True):\n    if inclusive:\n        mask = (x>=lower) & (x<=upper)\n    else:\n        mask = (x>lower) & (x<upper)\n    return x[mask]\n\ndef benchmark(func, x, lowers, uppers):\n    for l,u in zip(lowers, uppers):\n        func(x,lower=l,upper=u)\n\nn_samples = 1000\nn_reps = 100\nx = pd.Series(np.random.randn(n_samples))\n# Sort the Series.\n# For pandas>=1.0:\n# x = x.sort_values(ignore_index=True)\nx = x.sort_values().reset_index(drop=True)\n\n# Assert equivalence of different methods.\nassert(between_fast(x, 0, 1, True ).equals(between(x, 0, 1, True)))\nassert(between_expr(x, 0, 1, True ).equals(between(x, 0, 1, True)))\nassert(between_fast(x, 0, 1, False).equals(between(x, 0, 1, False)))\nassert(between_expr(x, 0, 1, False).equals(between(x, 0, 1, False)))\n\n# Benchmark repeated evaluations of between().\nuppers = np.linspace(0, 3, n_reps)\nlowers = -uppers\n%timeit benchmark(between_fast, x, lowers, uppers)\n%timeit benchmark(between, x, lowers, uppers)\n%timeit benchmark(between_expr, x, lowers, uppers)\n
Run Code Online (Sandbox Code Playgroud)\n


小智 5

而不是这个

df = df[99 <= df['closing_price'] <= 101]
Run Code Online (Sandbox Code Playgroud)

你应该使用这个

df = df[(99 <= df['closing_price']) & (df['closing_price'] <= 101)]
Run Code Online (Sandbox Code Playgroud)

我们必须使用 NumPy 的按位逻辑运算符|, &, ~,^来进行复合查询。此外,括号对于运算符优先级也很重要。

有关更多信息,您可以访问链接:比较、掩码和布尔逻辑(摘自 Jake VanderPlas 的《Python 数据科学手册》)。


spa*_*row 5

如果您正在处理多个值和多个输入,您还可以设置像这样的应用函数。在本例中,过滤数据帧以查找特定范围内的 GPS 位置。

def filter_values(lat,lon):
    if abs(lat - 33.77) < .01 and abs(lon - -118.16) < .01:
        return True
    elif abs(lat - 37.79) < .01 and abs(lon - -122.39) < .01:
        return True
    else:
        return False


df = df[df.apply(lambda x: filter_values(x['lat'],x['lon']),axis=1)]
Run Code Online (Sandbox Code Playgroud)