更快地迭代 Pandas Dataframe 的方法?

use*_*979 4 python dataframe pandas

我有一个字符串列表,比方说:

fruit_list = ["apple", "banana", "coconut"]
Run Code Online (Sandbox Code Playgroud)

我有一些 Pandas Dataframe,例如:

import pandas as pd

data = [['Apple farm', 10], ['Banana field', 15], ['Coconut beach', 14], ['corn field', 10]]
df = pd.DataFrame(data, columns = ['fruit_source', 'value'])
Run Code Online (Sandbox Code Playgroud)

我想根据现有列“fruit_source”的文本搜索来填充一个新列。我想要填充的是与 df 中的特定列匹配的任何元素。一种写法是:

df["fruit"] = NaN

for index, row in df.iterrows():
    for fruit in fruit_list:
        if fruit in row['fruit_source']:
            df.loc[index,'fruit'] = fruit
        else:
            df.loc[index,'fruit'] = "fruit not found"
Run Code Online (Sandbox Code Playgroud)

其中数据框填充了水果源收集的水果的新列。

然而,当将其扩展到更大的数据帧时,这种迭代可能会带来性能问题。原因是,随着引入更多行,迭代也会由于迭代列表而爆炸。

有没有更有效的方法可以做到?

AKX*_*AKX 6

你可以让 Pandas 完成这样的工作:

# Prime series with the "fruit not found" value
df['fruit'] = "fruit not found"
for fruit in fruit_list:
    # Generate boolean series of rows matching the fruit
    mask = df['fruit_source'].str.contains(fruit, case=False)
    # Replace those rows in-place with the name of the fruit
    df['fruit'].mask(mask, fruit, inplace=True)
Run Code Online (Sandbox Code Playgroud)

print(df)然后会说

    fruit_source  value            fruit
0     Apple farm     10            apple
1   Banana field     15           banana
2  Coconut beach     14          coconut
3     corn field     10  fruit not found
Run Code Online (Sandbox Code Playgroud)


Cor*_*ien 5

与正则表达式模式一起使用str.extract以避免循环:

import re

pattern = fr"({'|'.join(fruit_list)})"
df['fruit'] = df['fruit_source'].str.extract(pattern, flags=re.IGNORECASE) \
                                .fillna('fruit not found')
Run Code Online (Sandbox Code Playgroud)

输出:

>>> df
    fruit_source  value            fruit
0     Apple farm     10            Apple
1   Banana field     15           Banana
2  Coconut beach     14          Coconut
3     corn field     10  fruit not found

>>> pattern
'(apple|banana|coconut)'
Run Code Online (Sandbox Code Playgroud)