假设我们在Python Pandas中有一个数据框,如下所示:
df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': [u'aball', u'bball', u'cnut', u'fball']})
Run Code Online (Sandbox Code Playgroud)
或者,以表格形式:
ids vals
aball 1
bball 2
cnut 3
fball 4
Run Code Online (Sandbox Code Playgroud)
如何过滤包含关键词"ball?"的行?例如,输出应为:
ids vals
aball 1
bball 2
fball 4
Run Code Online (Sandbox Code Playgroud) 我正在搜索400万行数据框中的一个子字符串或多个子字符串。
df[df.col.str.contains('Donald',case=True,na=False)]
Run Code Online (Sandbox Code Playgroud)
要么
df[df.col.str.contains('Donald|Trump|Dump',case=True,na=False)]
Run Code Online (Sandbox Code Playgroud)
DataFrame(df)如下所示(具有400万个字符串行)
df = pd.DataFrame({'col': ["very definition of the American success story, continually setting the standards of excellence in business, real estate and entertainment.",
"The myriad vulgarities of Donald Trump—examples of which are retailed daily on Web sites and front pages these days—are not news to those of us who have",
"While a fearful nation watched the terrorists attack again, striking the cafés of Paris and the conference rooms of San Bernardino"]})
Run Code Online (Sandbox Code Playgroud)
有什么技巧可以使此字符串搜索更快?例如,首先对数据框进行排序,某种索引方式,将列名更改为数字,从查询中删除“ na = False”等?即使是几毫秒的速度提高也将非常有帮助!