从HTML,CSS和JavaScript中获取干净的字符串

jxp*_*hon 5 python regex web-scraping python-3.x

目前,我正试图在sec.gov上搜索10-K提交文本文件.

这是一个示例文本文件:https:
//www.sec.gov/Archives/edgar/data/320193/000119312515356351/0001193125-15-356351.txt

文本文档包含HTML标记,CSS样式和JavaScript等内容.理想情况下,我想在删除所有标签和样式后仅删除内容.

首先,我尝试了get_text()BeautifulSoup 的明显方法.这没有成功.
然后我尝试使用正则表达式删除<和>之间的所有内容.不幸的是,这也没有完全解决.它保留了一些标签,样式和脚本.

有没有人为我实现目标有一个干净的解决方案?

到目前为止,这是我的代码:

import requests
import re

url = 'https://www.sec.gov/Archives/edgar/data/320193/000119312515356351/0001193125-15-356351.txt'
response = requests.get(url)
text = re.sub('<.*?>', '', response.text)
print(text)
Run Code Online (Sandbox Code Playgroud)

Iva*_*aer 5

让我们根据示例设置一个虚拟字符串:

original_content = """
<script>console.log("test");</script>
<TD VALIGN="bottom" ALIGN="center"><FONT STYLE="font-family:Arial; ">(Address of principal executive offices)</FONT></TD>
"""
Run Code Online (Sandbox Code Playgroud)

现在让我们删除所有的 javascript。

from lxml.html.clean import Cleaner # remove javascript

# Delete javascript tags (some other options are left for the sake of example).

cleaner = Cleaner(
    comments = True, # True = remove comments
    meta=True, # True = remove meta tags
    scripts=True, # True = remove script tags
    embedded = True, # True = remove embeded tags
)
clean_dom = cleaner.clean_html(original_content)
Run Code Online (Sandbox Code Playgroud)

(来自/sf/answers/3245984801/

然后我们可以使用库删除 HTML 标签(提取文本)HTMLParser

from HTMLParser import HTMLParser

# Strip HTML.

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

text_content = strip_tags(clean_dom)

print text_content
Run Code Online (Sandbox Code Playgroud)

(来自:https : //stackoverflow.com/a/925630/1204332

或者我们可以通过lxml库获取文本:

from lxml.html import fromstring

print fromstring(original_content).text_content()
Run Code Online (Sandbox Code Playgroud)