如何在循环中使用pandas string contains(str.contain)？

Question

如何在循环中使用pandas string contains(str.contain)？

sne*_*e89 2 python string dataframe pandas

我有一个带有A列的pandas df,它是一串字符串.系列中的每个项目(即数据库中的每一行)只是一个长字符串,以逗号分隔.我想创建一个名为B的新列,每次从A列的每一行中出现一个单独列表中的对象时,该列会递增计数.例如:

我的列表看起来像这样:

list = ('dog', 'bird', 'cat')

Run Code Online (Sandbox Code Playgroud)

我的数据框看起来像这样:

A                           B
dog, bird                   2
cat, bird                   2
dog, snake                  1
cat, bird, snake            2
dog, bird, cat, snake       3
dog, bird cat               3

Run Code Online (Sandbox Code Playgroud)

我正在尝试创建一个执行以下操作的嵌套循环:从df.a [0]开始(即df.A的第一个值),找出它是否包含列表的第一个值(即'dog') .如果df.A [0]包含,则将1添加到B.然后,保持在df.A的同一行中,继续到列表的第二个值(即'bird').如果df.A [0]也包含该值,则将另外1添加到B. etc等.

这是我正在尝试使用的代码.

for i in df['A']:
    for j in list:
        if i.str.contains(j):
            df['B'] += 1

Run Code Online (Sandbox Code Playgroud)

但是,我一直收到错误:

'str' object has no attribute 'str'

Run Code Online (Sandbox Code Playgroud)

我如何告诉熊猫看整个系列,同时还告诉它作为一个具有上述结构的循环？或者,解决这个问题的最佳方法是什么？

Answer 1

cs9*_*s95 5

几个笔记 -

如果可以避免,请不要迭代DataFrame.总是期待矢量化.如果你不能,那么只能使用列表理解
迭代列时,您将迭代单个字符串项.那些没有.str属性.
不要使用list和其他类似名称(dict,tuple)来命名变量/对象,它们会影响内置函数.我已将您的变量重命名为substr以下.

KISS解决方案将涉及str.findall+ str.len.不需要拆分.

substr = ('dog', 'bird', 'cat')
df['B'] = df['A'].str.findall('|'.join(substr)).str.len()

Run Code Online (Sandbox Code Playgroud)

df['B']

0    2
1    2
2    1
3    2
4    3
5    3
Name: A, dtype: int64

Run Code Online (Sandbox Code Playgroud)

如果您有大字符串和大量子字符串,您可能需要查看使用Aho-Corasick算法.

归档时间：	7 年，7 月前
查看次数：	1623 次
最近记录：	6 年，11 月前