如何根据多列中的字符串匹配选择 Pandas 数据框中的行

arr*_*vis 5 python dataframe pandas

我认为这个确切的问题还没有得到回答,所以这里是。

我有一个 Pandas 数据框,我想选择 A 列或 B 列中包含字符串的所有行。

假设数据框如下所示:

d = {'id':["1", "2", "3", "4"], 
    'title': ["Horses are good", "Cats are bad", "Frogs are nice", "Turkeys are the best"], 
    'description':["Horse epitome", "Cats bad but horses good", "Frog fancier", "Turkey tome, not about horses"],
   'tags':["horse, cat, frog, turkey", "horse, cat, frog, turkey", "horse, cat, frog, turkey", "horse, cat, frog, turkey"],
   'date':["2019-01-01", "2019-10-01", "2018-08-14", "2016-11-29"]}

dataframe  = pandas.DataFrame(d)
Run Code Online (Sandbox Code Playgroud)

这使:

id              title                      description               tag           date
1   "Horses are good"                  "Horse epitome"       "horse, cat"    2019-01-01
2      "Cats are bad"                       "Cats bad"       "horse, cat"    2019-10-01
3    "Frogs are nice"      "Frog fancier, horses good"      "horse, frog"    2018-08-14
4   "Turkey are best"                    "Turkey tome"    "turkey, horse"    2016-11-29
Run Code Online (Sandbox Code Playgroud)

假设我想创建一个新的数据框,其中包含列或列中带有字符串horse(忽略大写)title的行description,但不在列tag(或任何其他列)中。

结果应该是(第 2 行和第 4 行被删除):

id                title                     description                 tag          date  
1     "Horses are good"                  "Horse epitome"       "horse, cat"    2019-01-01
3      "Frogs are nice"      "Frog fancier, horses good"      "horse, frog"    2018-08-14
Run Code Online (Sandbox Code Playgroud)

我看过一栏的一些答案,例如:

dataframe[dataframe['title'].str.contains('horse')]
Run Code Online (Sandbox Code Playgroud)

但我不确定 (1) 如何向此语句添加多列以及 (2) 如何修改它,例如string.lower()删除字符串匹配的列值中的大写字母。

提前致谢!

jez*_*ael 7

如果要指定用于测试的列,一种可能的解决方案是连接所有列,然后使用Series.str.contains和进行测试case=False

s = dataframe['title'] + dataframe['description']
df = dataframe[s.str.contains('horse', case=False)]
Run Code Online (Sandbox Code Playgroud)

或为每个列的条件,并通过逐位把它们连OR|

df = dataframe[dataframe['title'].str.contains('horse', case=False) | 
               dataframe['description'].str.contains('horse', case=False)]
Run Code Online (Sandbox Code Playgroud)

此外,如果要指定列列不测试链解决方案与按位AND与反转条件~for NOT MATCH

df = dataframe[s.str.contains('horse', case=False) &
               ~dataframe['tags'].str.contains('horse', case=False)]
Run Code Online (Sandbox Code Playgroud)

对于第二个解决方案,()在所有列周围添加链接OR

df = dataframe[(dataframe['title'].str.contains('horse', case=False) | 
               dataframe['description'].str.contains('horse', case=False)) &
              ~dataframe['tags'].str.contains('horse', case=False)]]
Run Code Online (Sandbox Code Playgroud)

编辑:

就像@WeNYoBen 评论的那样,您可以添加DataFrame.copy到结尾以防止SettingWithCopyWarning,例如:

s = dataframe['title'] + dataframe['description']
df = dataframe[s.str.contains('horse', case=False)].copy()
Run Code Online (Sandbox Code Playgroud)