在多列中查找字符串？

Question

在多列中查找字符串？

ste*_*boc 1 pandas

我有一个包含3列tel1，tel2，tel3的数据框，我想保留在一个或多个列中包含特定值的行：

例如，我想保留第tel1或tel2或tel3列以“ 06”开头的行

我怎样才能做到这一点？谢谢

Answer 1

unu*_*tbu 6

让我们用它df作为示例DataFrame：

In [54]: df = pd.DataFrame({'tel{}'.format(j): 
                            ['{:02d}'.format(i+j) 
                             for i in range(10)] for j in range(3)})

In [71]: df
Out[71]: 
  tel0 tel1 tel2
0   00   01   02
1   01   02   03
2   02   03   04
3   03   04   05
4   04   05   06
5   05   06   07
6   06   07   08
7   07   08   09
8   08   09   10
9   09   10   11

Run Code Online (Sandbox Code Playgroud)

您可以使用 StringMethods.startswith查找df['tel0']开头的值：'06'

In [72]: df['tel0'].str.startswith('06')
Out[72]: 
0    False
1    False
2    False
3    False
4    False
5    False
6     True
7    False
8    False
9    False
Name: tel0, dtype: bool

Run Code Online (Sandbox Code Playgroud)

要将两个布尔系列与逻辑或组合，请使用|：

In [73]: df['tel0'].str.startswith('06') | df['tel1'].str.startswith('06')
Out[73]: 
0    False
1    False
2    False
3    False
4    False
5     True
6     True
7    False
8    False
9    False
dtype: bool

Run Code Online (Sandbox Code Playgroud)

或者，如果您想使用逻辑或组合一个布尔系列列表，则可以使用reduce：

In [79]: import functools
In [80]: import numpy as np
In [80]: mask = functools.reduce(np.logical_or, [df['tel{}'.format(i)].str.startswith('06') for i in range(3)])

In [81]: mask
Out[81]: 
0    False
1    False
2    False
3    False
4     True
5     True
6     True
7    False
8    False
9    False
Name: tel0, dtype: bool

Run Code Online (Sandbox Code Playgroud)

一旦有了boolean mask，就可以使用df.loc以下命令选择关联的行：

In [75]: df.loc[mask]
Out[75]: 
  tel0 tel1 tel2
4   04   05   06
5   05   06   07
6   06   07   08

Run Code Online (Sandbox Code Playgroud)

请注意，除了startswith外，还有许多其他矢量化str方法。您可能会发现str.contains查找包含字符串的行很有用。请注意，str.contains默认情况下将其参数解释为正则表达式模式：

In [85]: df['tel0'].str.contains(r'6|7')
Out[85]: 
0    False
1    False
2    False
3    False
4    False
5    False
6     True
7     True
8    False
9    False
Name: tel0, dtype: bool

Run Code Online (Sandbox Code Playgroud)

如果数据帧在索引中具有 NaN 值，则可能会出现错误消息：`ValueError: cannot index with vector contains NA / NaN values`。通常，最好在索引中具有唯一的非 NaN 值。要使索引唯一，您可以使用 df = `df.reset_index()`。这会将旧索引移动到新列（或多索引情况下的列）。那么上面显示的方法应该可以工作。 (2认同)

归档时间：	11 年前
查看次数：	3957 次
最近记录：	11 年前