从网页代码中删除广告

Question

从网页代码中删除广告

Nik*_*pov 0 beautifulsoup adblock web-scraping python-3.x mechanicalsoup

我有广告拦截规则列表（示例）
如何将它们应用到网页？我使用 MechanicalSoup（基于 BeautifulSoup）下载网页代码。我想将其保存为 bs 格式，但 etree 也可以。
我尝试使用以下代码，但某些页面存在问题：
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

Answer 1

Jat*_*mir 5

与 Nikita 的回答中的代码几乎相同，但希望与所有导入共享它，而不mechanicalsoup依赖于想要尝试它的人。

from lxml.etree import tostring
import lxml.html
import requests

# take AdRemover code from here:
# https://github.com/buriy/python-readability/issues/43#issuecomment-321174825
from adremover import AdRemover

url = 'https://google.com'  # replace it with a url you want to apply the rules to  
rule_urls = ['https://easylist-downloads.adblockplus.org/ruadlist+easylist.txt',
             'https://filters.adtidy.org/extension/chromium/filters/1.txt']

rule_files = [url.rpartition('/')[-1] for url in rule_urls]


# download files containing rules
for rule_url, rule_file in zip(rule_urls, rule_files):
    r = requests.get(rule_url)
    with open(rule_file, 'w') as f:
        print(r.text, file=f)


remover = AdRemover(*rule_files)

html = requests.get(url).text
document = lxml.html.document_fromstring(html)
remover.remove_ads(document)
clean_html = tostring(document).decode("utf-8")

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年，7 月前
查看次数：	4809 次
最近记录：	7 年，7 月前