Aar*_*lli 41 python beautifulsoup web-scraping
我试图使用Python将html块转换为文本.
输入:
<div class="body"><p><strong></strong></p>
<p><strong></strong>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Consectetuer adipiscing elit. <a href="http://example.com/" target="_blank" class="source">Some Link</a> Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p></div>
Run Code Online (Sandbox Code Playgroud)
期望的输出:
Lorem ipsum dolor坐在amet,consectetuer adipiscing elit.Aenean commodo ligula eget dolor.Aenean massa
Consectetuer adipiscing elit.一些链接Aenean commodo ligula eget dolor.Aenean massa
Aenean massa.Lorem ipsum dolor sit amet,consectetuer adipiscing elit.Aenean commodo ligula eget dolor.Aenean massa
Lorem ipsum dolor坐在amet,consectetuer adipiscing elit.Aenean commodo ligula eget dolor.Aenean massa
Consectetuer adipiscing elit.Aenean commodo ligula eget dolor.Aenean massa
我尝试使用html2text模块没有太大的成功(我对python很新:))
这是我尝试过的:
#!/usr/bin/env python
import urllib2
import html2text
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://example.com/page.html').read())
txt = soup.find('div', {'class' : 'body'})
print html2text.html2text(txt)
Run Code Online (Sandbox Code Playgroud)
"txt"对象生成上面的html块.我想将其转换为文本并在屏幕上打印.
任何有关这段代码的帮助都将非常感激.
roo*_*oot 59
我错过了什么?soup.get_text()
给出你想要的完全相同的输出......
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
print(soup.get_text())
Run Code Online (Sandbox Code Playgroud)
产量
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Some Link Aenean commodo ligula eget dolor. Aenean massa
Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Run Code Online (Sandbox Code Playgroud)
PS!确切地说,您可以用双倍替换换行符 - 然后它与您的示例相同:)
print(soup.get_text('\n'))
Run Code Online (Sandbox Code Playgroud)
And*_*eas 10
主要问题是如何保留一些基本格式。这是我自己保留新行和项目符号的最低限度方法。我确信这不是您想要保留的所有问题的解决方案,但它是一个起点:
from bs4 import BeautifulSoup
def parse_html(html):
elem = BeautifulSoup(html, features="html.parser")
text = ''
for e in elem.descendants:
if isinstance(e, str):
text += e.strip()
elif e.name in ['br', 'p', 'h1', 'h2', 'h3', 'h4','tr', 'th']:
text += '\n'
elif e.name == 'li':
text += '\n- '
return text
Run Code Online (Sandbox Code Playgroud)
上面为元素添加了一个新行'br', 'p', 'h1', 'h2', 'h3', 'h4','tr', 'th'
并-
在文本前面添加了一个新行li
'\n'
在段落之间放置换行符。
from bs4 import Beautifulsoup
soup = Beautifulsoup(text)
print(soup.get_text('\n'))
Run Code Online (Sandbox Code Playgroud)
您可以使用正则表达式,但不推荐使用。以下代码删除数据中的所有 HTML 标记,为您提供文本:
import re
data = """<div class="body"><p><strong></strong></p>
<p><strong></strong>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Consectetuer adipiscing elit. <a href="http://example.com/" target="_blank" class="source">Some Link</a> Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p></div>"""
data = re.sub(r'<.*?>', '', data)
print(data)
Run Code Online (Sandbox Code Playgroud)
输出
Run Code Online (Sandbox Code Playgroud)Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa Consectetuer adipiscing elit. Some Link Aenean commodo ligula eget dolor. Aenean massa Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
可以使用python standard html.parser
:
from html.parser import HTMLParser
class HTMLFilter(HTMLParser):
text = ""
def handle_data(self, data):
self.text += data
f = HTMLFilter()
f.feed(data)
print(f.text)
Run Code Online (Sandbox Code Playgroud)
我非常喜欢 @FrBrGeorge 的无依赖性答案,因此我将其扩展为仅提取body
标签并添加了一个方便的方法,以便 HTML 到文本是一行:
from abc import ABC
from html.parser import HTMLParser
class HTMLFilter(HTMLParser, ABC):
"""
A simple no dependency HTML -> TEXT converter.
Usage:
str_output = HTMLFilter.convert_html_to_text(html_input)
"""
def __init__(self, *args, **kwargs):
self.text = ''
self.in_body = False
super().__init__(*args, **kwargs)
def handle_starttag(self, tag: str, attrs):
if tag.lower() == "body":
self.in_body = True
def handle_endtag(self, tag):
if tag.lower() == "body":
self.in_body = False
def handle_data(self, data):
if self.in_body:
self.text += data
@classmethod
def convert_html_to_text(cls, html: str) -> str:
f = cls()
f.feed(html)
return f.text.strip()
Run Code Online (Sandbox Code Playgroud)
使用方法见评论。
这会转换 内的所有文本body
,理论上可以包含style
和script
标签。进一步的过滤可以通过扩展所示的模式来实现body
——即设置实例变量in_style
或in_script
。