我试图修改一个DataFrame df只包含列中的值closing_price介于99和101之间的行,并尝试使用下面的代码执行此操作.
但是,我得到了错误
ValueError:Series的真值是不明确的.使用a.empty,a.bool(),a.item(),a.any()或a.all()
我想知道是否有办法在不使用循环的情况下执行此操作.
df = df[(99 <= df['closing_price'] <= 101)]
Run Code Online (Sandbox Code Playgroud)
Par*_*ait 108
还要考虑以下系列:
df = df[df['closing_price'].between(99, 101)]
Run Code Online (Sandbox Code Playgroud)
Jia*_* Li 77
您应该使用()分组布尔向量来消除歧义.
df = df[(df['closing_price'] >= 99) & (df['closing_price'] <= 101)]
Run Code Online (Sandbox Code Playgroud)
Max*_*axU 19
有一个更好的替代方法 - 使用query()方法:
In [58]: df = pd.DataFrame({'closing_price': np.random.randint(95, 105, 10)})
In [59]: df
Out[59]:
closing_price
0 104
1 99
2 98
3 95
4 103
5 101
6 101
7 99
8 95
9 96
In [60]: df.query('99 <= closing_price <= 101')
Out[60]:
closing_price
1 99
5 101
6 101
7 99
Run Code Online (Sandbox Code Playgroud)
更新:回答评论:
我喜欢这里的语法,但在尝试与expresison结合时却倒下了;
df.query('(mean + 2 *sd) <= closing_price <=(mean + 2 *sd)')
In [161]: qry = "(closing_price.mean() - 2*closing_price.std())" +\
...: " <= closing_price <= " + \
...: "(closing_price.mean() + 2*closing_price.std())"
...:
In [162]: df.query(qry)
Out[162]:
closing_price
0 97
1 101
2 97
3 95
4 100
5 99
6 100
7 101
8 99
9 95
Run Code Online (Sandbox Code Playgroud)
小智 11
你也可以使用.between()方法
emp = pd.read_csv("C:\\py\\programs\\pandas_2\\pandas\\employees.csv")
emp[emp["Salary"].between(60000, 61000)]
Run Code Online (Sandbox Code Playgroud)
输出
newdf = df.query('closing_price.mean() <= closing_price <= closing_price.std()')
Run Code Online (Sandbox Code Playgroud)
或者
mean = closing_price.mean()
std = closing_price.std()
newdf = df.query('@mean <= closing_price <= @std')
Run Code Online (Sandbox Code Playgroud)
如果必须重复调用(针对不同的边界和),则会不必要地重复大量工作。在这种情况下,对帧/系列进行一次排序然后使用 是有益的。我测得加速高达 25 倍,见下文。pd.Series.between(l,r) lrpd.Series.searchsorted()
def between_indices(x, lower, upper, inclusive=True):\n """\n Returns smallest and largest index i for which holds \n lower <= x[i] <= upper, under the assumption that x is sorted.\n """\n i = x.searchsorted(lower, side="left" if inclusive else "right")\n j = x.searchsorted(upper, side="right" if inclusive else "left")\n return i, j\n\n# Sort x once before repeated calls of between()\nx = x.sort_values().reset_index(drop=True)\n# x = x.sort_values(ignore_index=True) # for pandas>=1.0\nret1 = between_indices(x, lower=0.1, upper=0.9)\nret2 = between_indices(x, lower=0.2, upper=0.8)\nret3 = ...\nRun Code Online (Sandbox Code Playgroud)\n基准
\n测量重复评估 ( n_reps=100)以及基于不同参数和 的pd.Series.between()方法。在我的 MacBook Pro 2015 上,使用 Python v3.8.0 和 Pandas v1.0.3,以下代码会产生以下输出pd.Series.searchsorted()lowerupper
# pd.Series.searchsorted()\n# 5.87 ms \xc2\xb1 321 \xc2\xb5s per loop (mean \xc2\xb1 std. dev. of 7 runs, 100 loops each)\n# pd.Series.between(lower, upper)\n# 155 ms \xc2\xb1 6.08 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 10 loops each)\n# Logical expressions: (x>=lower) & (x<=upper)\n# 153 ms \xc2\xb1 3.52 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 10 loops each)\nRun Code Online (Sandbox Code Playgroud)\nimport numpy as np\nimport pandas as pd\n\ndef between_indices(x, lower, upper, inclusive=True):\n # Assumption: x is sorted.\n i = x.searchsorted(lower, side="left" if inclusive else "right")\n j = x.searchsorted(upper, side="right" if inclusive else "left")\n return i, j\n\ndef between_fast(x, lower, upper, inclusive=True):\n """\n Equivalent to pd.Series.between() under the assumption that x is sorted.\n """\n i, j = between_indices(x, lower, upper, inclusive)\n if True:\n return x.iloc[i:j]\n else:\n # Mask creation is slow.\n mask = np.zeros_like(x, dtype=bool)\n mask[i:j] = True\n mask = pd.Series(mask, index=x.index)\n return x[mask]\n\ndef between(x, lower, upper, inclusive=True):\n mask = x.between(lower, upper, inclusive=inclusive)\n return x[mask]\n\ndef between_expr(x, lower, upper, inclusive=True):\n if inclusive:\n mask = (x>=lower) & (x<=upper)\n else:\n mask = (x>lower) & (x<upper)\n return x[mask]\n\ndef benchmark(func, x, lowers, uppers):\n for l,u in zip(lowers, uppers):\n func(x,lower=l,upper=u)\n\nn_samples = 1000\nn_reps = 100\nx = pd.Series(np.random.randn(n_samples))\n# Sort the Series.\n# For pandas>=1.0:\n# x = x.sort_values(ignore_index=True)\nx = x.sort_values().reset_index(drop=True)\n\n# Assert equivalence of different methods.\nassert(between_fast(x, 0, 1, True ).equals(between(x, 0, 1, True)))\nassert(between_expr(x, 0, 1, True ).equals(between(x, 0, 1, True)))\nassert(between_fast(x, 0, 1, False).equals(between(x, 0, 1, False)))\nassert(between_expr(x, 0, 1, False).equals(between(x, 0, 1, False)))\n\n# Benchmark repeated evaluations of between().\nuppers = np.linspace(0, 3, n_reps)\nlowers = -uppers\n%timeit benchmark(between_fast, x, lowers, uppers)\n%timeit benchmark(between, x, lowers, uppers)\n%timeit benchmark(between_expr, x, lowers, uppers)\nRun Code Online (Sandbox Code Playgroud)\n
小智 5
而不是这个
df = df[99 <= df['closing_price'] <= 101]
Run Code Online (Sandbox Code Playgroud)
你应该使用这个
df = df[(99 <= df['closing_price']) & (df['closing_price'] <= 101)]
Run Code Online (Sandbox Code Playgroud)
我们必须使用 NumPy 的按位逻辑运算符|, &, ~,^来进行复合查询。此外,括号对于运算符优先级也很重要。
有关更多信息,您可以访问链接:比较、掩码和布尔逻辑(摘自 Jake VanderPlas 的《Python 数据科学手册》)。
如果您正在处理多个值和多个输入,您还可以设置像这样的应用函数。在本例中,过滤数据帧以查找特定范围内的 GPS 位置。
def filter_values(lat,lon):
if abs(lat - 33.77) < .01 and abs(lon - -118.16) < .01:
return True
elif abs(lat - 37.79) < .01 and abs(lon - -122.39) < .01:
return True
else:
return False
df = df[df.apply(lambda x: filter_values(x['lat'],x['lon']),axis=1)]
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
79539 次 |
| 最近记录: |