pandas + dataframe - 按部分字符串选择

Question

pandas + dataframe - 按部分字符串选择

euf*_*ria 356 python string dataframe pandas

我有一个DataFrame4列,其中2列包含字符串值.我想知道是否有办法根据与特定列的部分字符串匹配来选择行？

换句话说,函数或lambda函数会做类似的事情

re.search(pattern, cell_in_question)

Run Code Online (Sandbox Code Playgroud)

返回一个布尔值.我熟悉语法,df[df['A'] == "hello world"]但似乎无法找到一种方法来做同样的部分字符串匹配说'hello'.

有人能指出我正确的方向吗？

Answer 1

Gar*_*ett 668

基于github问题#620,看起来你很快就能做到以下几点:

df[df['A'].str.contains("hello")]

Run Code Online (Sandbox Code Playgroud)

更新:矢量化字符串方法(即Series.str)在pandas 0.8.1及更高版本中可用.

由于str.*方法将输入模式视为正则表达式,因此可以使用`df [df ['A'].str.contains("Hello | Britain")]` (42认同)
是否可以将`.str.contains`转换为[`.query()`api](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html#pandas .DataFrame.query)？ (5认同)
如果我想用“OR”条件找到它们，我们如何处理“Hello”和“Britain”。 (3认同)
@zyxue [用pandas通过部分字符串查询选择行](/sf/ask/3145315001/) (3认同)
当子字符串的顺序很重要/已知时，您可以使用`df[df.A.str.contains("STR1.*STR2")]`来“AND”子字符串。如果顺序不重要/未知，`df[df.A.str.contains("STR1") & df.A.str.contains("STR2")]` (3认同)
`df [df ['value'].astype(str).str.contains('1234.+')]`用于过滤掉非字符串类型的列. (2认同)
如果列中存在空值，则还必须包含忽略这些空值的标志（如果需要）：`df[df['A'].str.contains("hello", na=False)]` (2认同)

Answer 2

sha*_*ron 167

我在ipython笔记本上的macos上使用pandas 0.14.1.我尝试了上面的提议行:

df[df['A'].str.contains("Hello|Britain")]

Run Code Online (Sandbox Code Playgroud)

并得到一个错误:

"cannot index with vector containing NA / NaN values"

Run Code Online (Sandbox Code Playgroud)

但是当添加"== True"条件时它完美地工作,如下所示:

df[df['A'].str.contains("Hello|Britain")==True]

Run Code Online (Sandbox Code Playgroud)

或者你可以这样做:df [df ['A'].str.contains("Hello | Britain",na = False)] (46认同)
`df[df['A'].astype(str).str.contains("Hello|Britain")]` 也有效 (11认同)
另一个解决方案是： ``` df[df["A"].str.contains("Hello|Britain") == True] ``` (2认同)

Answer 3

cs9*_*s95 57

如何从熊猫DataFrame中按部分字符串选择？

这篇文章是为想要

在字符串列中搜索子字符串（最简单的情况）
搜索多个子字符串（类似于isin）
匹配文本中的整个单词（例如，“蓝色”应匹配“天空是蓝色”，而不是“ bluejay”）
匹配多个完整词
了解“ ValueError：无法使用包含NA / NaN值的向量进行索引”背后的原因

...并想进一步了解应优先采用哪种方法。

（PS：我在类似主题上看到了很多问题，我认为最好把它留在这里。）

基本子串搜索

# setup
df1 = pd.DataFrame({'col': ['foo', 'foobar', 'bar', 'baz']})
df1

      col
0     foo
1  foobar
2     bar
3     baz

Run Code Online (Sandbox Code Playgroud)

str.contains可用于执行子字符串搜索或基于正则表达式的搜索。搜索默认为基于正则表达式，除非您明确禁用它。

这是一个基于正则表达式的搜索示例，

# find rows in `df1` which contain "foo" followed by something
df1[df1['col'].str.contains(r'foo(?!$)')]

      col
1  foobar

Run Code Online (Sandbox Code Playgroud)

有时，不需要进行正则表达式搜索，因此请指定regex=False为禁用它。

#select all rows containing "foo"
df1[df1['col'].str.contains('foo', regex=False)]
# same as df1[df1['col'].str.contains('foo')] but faster.

      col
0     foo
1  foobar

Run Code Online (Sandbox Code Playgroud)

在性能方面，正则表达式搜索比子字符串搜索慢：

df2 = pd.concat([df1] * 1000, ignore_index=True)

%timeit df2[df2['col'].str.contains('foo')]
%timeit df2[df2['col'].str.contains('foo', regex=False)]

6.31 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.8 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Run Code Online (Sandbox Code Playgroud)

如果不需要，请避免使用基于正则表达式的搜索。

解决ValueError小号
有时，执行字符串搜索和对结果的过滤会导致

ValueError: cannot index with vector containing NA / NaN values
Run Code Online (Sandbox Code Playgroud)

这通常是由于对象列中存在混合数据或NaN，

s = pd.Series(['foo', 'foobar', np.nan, 'bar', 'baz', 123])
s.str.contains('foo|bar')

0     True
1     True
2      NaN
3     True
4    False
5      NaN
dtype: object


s[s.str.contains('foo|bar')]
# ---------------------------------------------------------------------------
# ValueError                                Traceback (most recent call last)

Run Code Online (Sandbox Code Playgroud)

非字符串的任何内容都不能应用字符串方法，因此结果自然是NaN。在这种情况下，请指定na=False忽略非字符串数据，

s.str.contains('foo|bar', na=False)

0     True
1     True
2    False
3     True
4    False
5    False
dtype: bool

Run Code Online (Sandbox Code Playgroud)

多个子串搜索

通过使用正则表达式OR管道进行正则表达式搜索，最容易实现这一点。

# Slightly modified example.
df4 = pd.DataFrame({'col': ['foo abc', 'foobar xyz', 'bar32', 'baz 45']})
df4

          col
0     foo abc
1  foobar xyz
2       bar32
3      baz 45

df4[df4['col'].str.contains(r'foo|baz')]

          col
0     foo abc
1  foobar xyz
3      baz 45

Run Code Online (Sandbox Code Playgroud)

您还可以创建一个术语列表，然后将其加入：

terms = ['foo', 'baz']
df4[df4['col'].str.contains('|'.join(terms))]

          col
0     foo abc
1  foobar xyz
3      baz 45

Run Code Online (Sandbox Code Playgroud)

有时，明智的做法是将您的术语转义，以防它们包含可被解释为正则表达式元字符的字符。如果您的条款包含以下任何字符...

. ^ $ * + ? { } [ ] \ | ( )

Run Code Online (Sandbox Code Playgroud)

然后，你就需要使用re.escape到逃避它们：

import re
df4[df4['col'].str.contains('|'.join(map(re.escape, terms)))]

          col
0     foo abc
1  foobar xyz
3      baz 45

Run Code Online (Sandbox Code Playgroud)

re.escape 具有转义特殊字符的效果，因此可以按字面意义对待它们。

re.escape(r'.foo^')
# '\\.foo\\^'

Run Code Online (Sandbox Code Playgroud)

匹配全词

默认情况下，子字符串搜索将搜索指定的子字符串/模式，而不管其是否为完整单词。为了只匹配完整的单词，我们将需要在这里使用正则表达式-特别是，我们的模式将需要指定单词边界（\b）。

例如，

df3 = pd.DataFrame({'col': ['the sky is blue', 'bluejay by the window']})
df3

                     col
0        the sky is blue
1  bluejay by the window

Run Code Online (Sandbox Code Playgroud)

现在考虑，

df3[df3['col'].str.contains('blue')]

                     col
0        the sky is blue
1  bluejay by the window

Run Code Online (Sandbox Code Playgroud)

伏/秒

df3[df3['col'].str.contains(r'\bblue\b')]

               col
0  the sky is blue

Run Code Online (Sandbox Code Playgroud)

多个全字搜索

与上述类似，不同之处\b在于我们在连接的模式中添加了字边界（）。

p = r'\b(?:{})\b'.format('|'.join(map(re.escape, terms)))
df4[df4['col'].str.contains(p)]

       col
0  foo abc
3   baz 45

Run Code Online (Sandbox Code Playgroud)

当p这个样子的，

p
# '\\b(?:foo|baz)\\b'

Run Code Online (Sandbox Code Playgroud)

一个很好的选择：使用列表推导！

因为你能！而且你应该！它们通常比字符串方法快一点，因为字符串方法难以向量化并且通常具有循环实现。

代替，

df1[df1['col'].str.contains('foo', regex=False)]

Run Code Online (Sandbox Code Playgroud)

in在列表组合中使用运算符，

df1[['foo' in x for x in df1['col']]]

       col
0  foo abc
1   foobar

Run Code Online (Sandbox Code Playgroud)

代替，

regex_pattern = r'foo(?!$)'
df1[df1['col'].str.contains(regex_pattern)]

Run Code Online (Sandbox Code Playgroud)

在列表组合中使用re.compile（用于缓存正则表达式）+ Pattern.search，

p = re.compile(regex_pattern, flags=re.IGNORECASE)
df1[[bool(p.search(x)) for x in df1['col']]]

      col
1  foobar

Run Code Online (Sandbox Code Playgroud)

如果“ col”具有NaN，则代替

df1[df1['col'].str.contains(regex_pattern, na=False)]

Run Code Online (Sandbox Code Playgroud)

使用，

def try_search(p, x):
    try:
        return bool(p.search(x))
    except TypeError:
        return False

p = re.compile(regex_pattern)
df1[[try_search(p, x) for x in df1['col']]]

      col
1  foobar

Run Code Online (Sandbox Code Playgroud)

偏字符串匹配更多选项：`np.char.find`，`np.vectorize`，`DataFrame.query`。

除了str.contains和列出理解，您还可以使用以下替代方法。

np.char.find
仅支持子字符串搜索（读取：无正则表达式）。

df4[np.char.find(df4['col'].values.astype(str), 'foo') > -1]

          col
0     foo abc
1  foobar xyz

Run Code Online (Sandbox Code Playgroud)

np.vectorize
这是一个循环的包装器，但是比大多数pandas str方法要少。

f = np.vectorize(lambda haystack, needle: needle in haystack)
f(df1['col'], 'foo')
# array([ True,  True, False, False])

df1[f(df1['col'], 'foo')]

       col
0  foo abc
1   foobar

Run Code Online (Sandbox Code Playgroud)

正则表达式解决方案可能：

regex_pattern = r'foo(?!$)'
p = re.compile(regex_pattern)
f = np.vectorize(lambda x: pd.notna(x) and bool(p.search(x)))
df1[f(df1['col'])]

      col
1  foobar

Run Code Online (Sandbox Code Playgroud)

DataFrame.query
通过python引擎支持字符串方法。这没有提供明显的性能优势，但是对于了解是否需要动态生成查询很有用。

df1.query('col.str.contains("foo")', engine='python')

      col
0     foo
1  foobar

Run Code Online (Sandbox Code Playgroud)

使用pd.eval（）在大熊猫的动态表达评估中可以找到有关方法的更多信息query和eval方法族。

推荐用法

（第一）str.contains，因为它简单易用，可以处理NaN和混合数据
列出其性能的理解（特别是如果您的数据是纯字符串）
np.vectorize
（持续） df.query

Answer 4

Phi*_*arz 46

如果有人想知道如何执行相关问题:"按部分字符串选择列"

使用:

df.filter(like='hello')  # select columns which contain the word hello

Run Code Online (Sandbox Code Playgroud)

并通过部分字符串匹配选择行,传递axis=0给过滤器:

# selects rows which contain the word hello in their index label
df.filter(like='hello', axis=0)

Run Code Online (Sandbox Code Playgroud)

可以进一步提炼为`df.filter(like ='a')` (16认同)
这可以提炼为:`df.loc [:,df.columns.str.contains('a')]` (6认同)
@PV8问题已经存在：/sf/ask/2208598871/。但是当我在谷歌上搜索“pandas Select column bypartial string”时，这个线程首先出现 (2认同)

Answer 5

小智 27

快速注意:如果要根据索引中包含的部分字符串进行选择,请尝试以下操作:

df['stridx']=df.index
df[df['stridx'].str.contains("Hello|Britain")]

Run Code Online (Sandbox Code Playgroud)

你可以直接df [df.index.to_series().str.contains('LLChit')] (5认同)

Answer 6

Mik*_*ike 20

说你有以下内容DataFrame:

>>> df = pd.DataFrame([['hello', 'hello world'], ['abcd', 'defg']], columns=['a','b'])
>>> df
       a            b
0  hello  hello world
1   abcd         defg

Run Code Online (Sandbox Code Playgroud)

您始终可以in在lambda表达式中使用运算符来创建过滤器.

>>> df.apply(lambda x: x['a'] in x['b'], axis=1)
0     True
1    False
dtype: bool

Run Code Online (Sandbox Code Playgroud)

这里的技巧是使用axis=1选项apply来逐行将元素传递给lambda函数,而不是逐列传递.

Answer 7

car*_*mom 12

您是否需要对Pandas 数据框列中的字符串进行不区分大小写的搜索：

df[df['A'].str.contains("hello", case=False)]

Run Code Online (Sandbox Code Playgroud)

Answer 8

dar*_*ils 9

您可以尝试将它们视为字符串：

df[df['A'].astype(str).str.contains("Hello|Britain")]

Run Code Online (Sandbox Code Playgroud)

Answer 9

euf*_*ria 7

这是我最终为部分字符串匹配做的事情.如果有人有更有效的方法,请告诉我.

def stringSearchColumn_DataFrame(df, colName, regex):
    newdf = DataFrame()
    for idx, record in df[colName].iteritems():

        if re.search(regex, record):
            newdf = concat([df[df[colName] == record], newdf], ignore_index=True)

    return newdf

Run Code Online (Sandbox Code Playgroud)

如果在循环之前编译regex，则应该快2到3倍：regex = re.compile（regex），然后如果regex.search（record） (3认同)

Answer 10

Ang*_*ena 7

假设我们在 dataframe 中有一个名为“ENTITY”的列df。我们可以通过使用掩码过滤我们的df, 以获得整个数据帧df，其中“实体”列的行不包含“DM”，如下所示：

mask = df['ENTITY'].str.contains('DM')

df = df.loc[~(mask)].copy(deep=True)

Run Code Online (Sandbox Code Playgroud)

Answer 11

Kat*_*atu 5

对于带有特殊字符的字符串，使用 contains 效果不佳。找到工作虽然。

df[df['A'].str.find("hello") != -1]

Run Code Online (Sandbox Code Playgroud)

Answer 12

Gra*_*non 5

一个更通用的示例 - 如果查找单词的一部分或字符串中的特定单词：

df = pd.DataFrame([('cat andhat', 1000.0), ('hat', 2000000.0), ('the small dog', 1000.0), ('fog', 330000.0),('pet', 330000.0)], columns=['col1', 'col2'])

Run Code Online (Sandbox Code Playgroud)

句子或单词的特定部分：

searchfor = '.*cat.*hat.*|.*the.*dog.*'

Run Code Online (Sandbox Code Playgroud)

创建显示受影响行的列（可以随时根据需要过滤掉）

df["TrueFalse"]=df['col1'].str.contains(searchfor, regex=True)

    col1             col2           TrueFalse
0   cat andhat       1000.0         True
1   hat              2000000.0      False
2   the small dog    1000.0         True
3   fog              330000.0       False
4   pet 3            30000.0        False

Run Code Online (Sandbox Code Playgroud)

归档时间：	13 年，5 月前
查看次数：	491174 次
最近记录：	6 年前

pandas + dataframe - 按部分字符串选择

如何从熊猫DataFrame中按部分字符串选择？

基本子串搜索

多个子串搜索

匹配全词

多个全字搜索

一个很好的选择：使用列表推导！

偏字符串匹配更多选项：np.char.find，np.vectorize，DataFrame.query。

推荐用法

偏字符串匹配更多选项：`np.char.find`，`np.vectorize`，`DataFrame.query`。