熊猫:无法从DataFrame列中剥离HTML标记

Question

熊猫:无法从DataFrame列中剥离HTML标记

我有一个Pandas DataFrame,其中text包含一个包含HTML 的列.我想获得文本,即剥离标签.我尝试在下面执行以下操作:

from bs4 import BeautifulSoup
result_df['text'] = BeautifulSoup(result_df['text']).get_text()

Run Code Online (Sandbox Code Playgroud)

但是,我最终收到此错误:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Run Code Online (Sandbox Code Playgroud)

我做错了什么？

谢谢!

Answer 1

小智 6

试试这个:

from bs4 import BeautifulSoup
result_df['text'] = [BeautifulSoup(text).get_text() for text in result_df['text'] ]

Run Code Online (Sandbox Code Playgroud)

Answer 2

Bil*_*ell 5

您也可以使用一种使用的方法apply，尽管我怀疑它会有多大区别。

>>> import pandas as pd
>>> data = {'a': ['<div><span>something</span></div>', '<a href="nowhere.org">erowhon</a>']}
>>> df = pd.DataFrame(data)
>>> df
                                   a
0  <div><span>something</span></div>
1  <a href="nowhere.org">erowhon</a>
>>> import bs4
>>> df['a'] = df['a'].apply(lambda x: bs4.BeautifulSoup(x, 'lxml').get_text())
>>> df
           a
0  something
1    erowhon

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，5 月前
查看次数：	1765 次
最近记录：	7 年，2 月前