如何在不下载文章的情况下使用 Newspaper3k 库?

Flu*_*lux 7 python python-newspaper

假设我有新闻文章的本地副本。我怎样才能在这些文章上运行报纸?根据文档,报纸库的正常使用是这样的:

from newspaper import Article

url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
article.download()
article = Article(url)
article.parse()
# ...
Run Code Online (Sandbox Code Playgroud)

就我而言,我不需要从网页下载文章,因为我已经有了该页面的本地副本。如何在网页的本地副本上使用报纸?

C.N*_*ivs 6

你可以,只是有点hacky。举个例子

import requests
from newspaper import Article

url = 'https://www.cnn.com/2019/06/19/india/chennai-water-crisis-intl-hnk/index.html'

# get sample html
r = requests.get(url)

# save to file
with open('file.html', 'wb') as fh:
    fh.write(r.content)

a = Article(url)

# set html manually
with open("file.html", 'rb') as fh:
    a.html = fh.read()

# need to set download_state to 2 for this to work
a.download_state = 2

a.parse()

# Now the article should be populated
a.text

# 'New Delhi (CNN) The floor...'
Run Code Online (Sandbox Code Playgroud)

download_state来自片段中newspaper.article.py

# /path/to/site-packages/newspaper/article.py
class ArticleDownloadState(object):
    NOT_STARTED = 0
    FAILED_RESPONSE = 1
    SUCCESS = 2

~snip~

# This is why you need to set that variable
class Article:
    def __init__(...):
        ~snip~
         # Keep state for downloads and parsing
        self.is_parsed = False
        self.download_state = ArticleDownloadState.NOT_STARTED
        self.download_exception_msg = None

    def parse(self):
        # will throw exception if download_state isn't 2
        self.throw_if_not_downloaded_verbose()

        self.doc = self.config.get_parser().fromstring(self.html)
Run Code Online (Sandbox Code Playgroud)

作为替代方案,您可以覆盖该类以对parse函数执行相同的操作:

# /path/to/site-packages/newspaper/article.py
class ArticleDownloadState(object):
    NOT_STARTED = 0
    FAILED_RESPONSE = 1
    SUCCESS = 2

~snip~

# This is why you need to set that variable
class Article:
    def __init__(...):
        ~snip~
         # Keep state for downloads and parsing
        self.is_parsed = False
        self.download_state = ArticleDownloadState.NOT_STARTED
        self.download_exception_msg = None

    def parse(self):
        # will throw exception if download_state isn't 2
        self.throw_if_not_downloaded_verbose()

        self.doc = self.config.get_parser().fromstring(self.html)
Run Code Online (Sandbox Code Playgroud)


Ris*_*Vij 5

确实有一种官方方法可以解决这里提到的这个问题

在程序中加载 html 后,您可以使用该set_html()方法将其设置为article.html

import newspaper
with open("file.html", 'rb') as fh:
    ht = fh.read()
article = newspaper.Article(url = ' ')
article.set_html(ht)
article.parse()
Run Code Online (Sandbox Code Playgroud)