use*_*979 4 python dataframe pandas
我有一个字符串列表,比方说:
fruit_list = ["apple", "banana", "coconut"]
Run Code Online (Sandbox Code Playgroud)
我有一些 Pandas Dataframe,例如:
import pandas as pd
data = [['Apple farm', 10], ['Banana field', 15], ['Coconut beach', 14], ['corn field', 10]]
df = pd.DataFrame(data, columns = ['fruit_source', 'value'])
Run Code Online (Sandbox Code Playgroud)
我想根据现有列“fruit_source”的文本搜索来填充一个新列。我想要填充的是与 df 中的特定列匹配的任何元素。一种写法是:
df["fruit"] = NaN
for index, row in df.iterrows():
for fruit in fruit_list:
if fruit in row['fruit_source']:
df.loc[index,'fruit'] = fruit
else:
df.loc[index,'fruit'] = "fruit not found"
Run Code Online (Sandbox Code Playgroud)
其中数据框填充了水果源收集的水果的新列。
然而,当将其扩展到更大的数据帧时,这种迭代可能会带来性能问题。原因是,随着引入更多行,迭代也会由于迭代列表而爆炸。
有没有更有效的方法可以做到?
你可以让 Pandas 完成这样的工作:
# Prime series with the "fruit not found" value
df['fruit'] = "fruit not found"
for fruit in fruit_list:
# Generate boolean series of rows matching the fruit
mask = df['fruit_source'].str.contains(fruit, case=False)
# Replace those rows in-place with the name of the fruit
df['fruit'].mask(mask, fruit, inplace=True)
Run Code Online (Sandbox Code Playgroud)
print(df)然后会说
fruit_source value fruit
0 Apple farm 10 apple
1 Banana field 15 banana
2 Coconut beach 14 coconut
3 corn field 10 fruit not found
Run Code Online (Sandbox Code Playgroud)
与正则表达式模式一起使用str.extract以避免循环:
import re
pattern = fr"({'|'.join(fruit_list)})"
df['fruit'] = df['fruit_source'].str.extract(pattern, flags=re.IGNORECASE) \
.fillna('fruit not found')
Run Code Online (Sandbox Code Playgroud)
输出:
>>> df
fruit_source value fruit
0 Apple farm 10 Apple
1 Banana field 15 Banana
2 Coconut beach 14 Coconut
3 corn field 10 fruit not found
>>> pattern
'(apple|banana|coconut)'
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
2553 次 |
| 最近记录: |