使用 Beautifulsoup4 从 HTML 中去除 Doctype？

Question

使用 Beautifulsoup4 从 HTML 中去除 Doctype？

art*_*son 5 python beautifulsoup python-3.x

我是 Python 和 BeautifulSoup 的新手，所以请耐心等待......

我正在尝试弄清楚如何使用 Beautifulsoup4 从 HTML 文件中删除 Doctype，但似乎无法确切地弄清楚如何实现这一点。

def saveToText(self):
    filename = os.path.join(self.parent.ReportPath, str(self.parent.CharName.text()) + "_report.txt")
    filename, filters = QFileDialog.getSaveFileName(self, "Save Report", filename, "Text (*.txt);;All Files (*.*)")

    if filename is not None and str(filename) != '':

        try:
            if re.compile('\.txt$').search(str(filename)) is None:
                filename = str(filename)
                filename += '.txt'

            soup = BeautifulSoup(self.reportHtml, "lxml")

            try:  # THROWS AttributeError IF NOT FOUND ..
                soup.find('font').extract()
            except AttributeError:
                pass

            try:  # THROWS AttributeError IF NOT FOUND ..
                soup.find('head').extract()

            except AttributeError:
                pass

            soup.html.unwrap()
            soup.body.unwrap()

            for b in soup.find_all('b'):
                b.unwrap()

            for table in soup.find_all('table'):
                table.unwrap()

            for td in soup.find_all('td'):
                td.unwrap()

            for br in soup.find_all('br'):
                br.replace_with('\n')

            for center in soup.find_all('center'):
                center.insert_after('\n')

            for dl in soup.find_all('dl'):
                dl.insert_after('\n')

            for dt in soup.find_all('dt'):
                dt.insert_after('\n')

            for hr in soup.find_all('hr'):
                hr.replace_with(('-' * 80) + '\n')

            for tr in soup.find_all('tr'):
                tr.insert_before('  ')
                tr.insert_after('\n')

            print(soup)

        except IOError:
            QMessageBox.critical(None, 'Error!', 'Error writing to file: ' + filename, 'OK')

Run Code Online (Sandbox Code Playgroud)

我尝试使用：

from bs4 import Doctype

if isinstance(e, Doctype):
    e.extract()

Run Code Online (Sandbox Code Playgroud)

但这抱怨“e”是一个未解析的引用。我已经搜索过文档和谷歌，但没有找到任何有用的东西。

顺便说一句，有没有办法减少这段代码？

Answer 1

art*_*son 3

这似乎完美地解决了这个问题。

from bs4 import BeautifulSoup, Doctype

for item in soup.contents:
    if isinstance(item, Doctype):
        item.extract()

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，3 月前
查看次数：	1677 次
最近记录：	8 年，3 月前