标签: python-newspaper

如何使用报纸库仅解析网站的特定类别?

我用Python3newspaper库.据说这个库可以创建一个Source对象,它是新闻网站的抽象.但是如果我只需要某个类别的抽象怎么办呢?

例如,当我使用这个网址时,我想获得该'technology'类别的所有文章.相反,我从中获取文章'politics'.

我认为在创建Source对象时,报纸只使用域名,在我的例子中是这样www.kyivpost.com.

有没有办法让它像网址一样工作http://www.kyivpost.com/technology/

python parsing web-scraping python-3.x python-newspaper

8
推荐指数
1
解决办法
550
查看次数

如何修复某些 URL 的 Newspaper3k 403 客户端错误?

我正在尝试使用 googlesearch 和 news3k python 包的组合来获取文章列表。使用 article.parse 时,我最终得到一个错误:news.article.ArticleException:文章download()因 403 客户端错误而失败:网址禁止:https : //www.newsweek.com/donald-trump-hillary-clinton-2020-拉力奥兰多-1444697在 URL https://www.newsweek.com/donald-trump-hillary-clinton-2020-rally-orlando-1444697

我尝试在执行脚本时以管理员身份运行,并且链接在浏览器中直接打开时有效。

这是我的代码:

import googlesearch
from newspaper import Article

query = "trump"
urlList = []

for j in googlesearch.search_news(query, tld="com", num=500, stop=200, pause=.01):
    urlList.append(j)

print(urlList)

articleList = []

for i in urlList:
    article = Article(i)
    article.download()
    article.html
    article.parse()
    articleList.append(article.text)
    print(article.text)
Run Code Online (Sandbox Code Playgroud)

这是我的完整错误输出:

Traceback (most recent call last):
  File "C:/Users/andre/PycharmProjects/StockBot/WebCrawlerTest.py", line 31, in <module>
    article.parse()
  File "C:\Users\andre\AppData\Local\Programs\Python\Python37\lib\site-packages\newspaper\article.py", line 191, in parse
    self.throw_if_not_downloaded_verbose()
  File "C:\Users\andre\AppData\Local\Programs\Python\Python37\lib\site-packages\newspaper\article.py", line …
Run Code Online (Sandbox Code Playgroud)

python url screen-scraping web python-newspaper

8
推荐指数
1
解决办法
2274
查看次数

如何在不下载文章的情况下使用 Newspaper3k 库?

假设我有新闻文章的本地副本。我怎样才能在这些文章上运行报纸?根据文档,报纸库的正常使用是这样的:

from newspaper import Article

url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
article.download()
article = Article(url)
article.parse()
# ...
Run Code Online (Sandbox Code Playgroud)

就我而言,我不需要从网页下载文章,因为我已经有了该页面的本地副本。如何在网页的本地副本上使用报纸?

python python-newspaper

7
推荐指数
2
解决办法
2500
查看次数

报纸图书馆的出版日期总是返回None

我最近一直在使用报纸图书馆。我发现的唯一问题是我什么时候article.publish_date总是得到None

class NewsArticle:
    def __init__(self,url):
        self.article = Article(url)
        self.article.download()
        self.article.parse()
        self.article.nlp()

    def getKeywords(self):
        x = self.article.keywords
        for i in range(0,len(x)):
            x[i] = x[i].encode('ascii', 'ignore')
        return x

        return self.article.keywords

    def getSummary(self):
        return self.article.summary.encode('ascii', 'ignore')

    def getAuthors(self):
        x = self.article.authors
        for i in range(0,len(x)):
            x[i] = x[i].encode('ascii', 'ignore')
        return x

    def thumbnail_url(self):
        return self.article.top_image.encode('ascii', 'ignore')

    def date_made(self):
        print self.article.publish_date
        return self.article.publish_date
    def get_videos(self):
        x=self.article.movies
        for i in range(0,len(x)):
            x[i] = x[i].encode('ascii', 'ignore')
        return x
    def get_title(self): …
Run Code Online (Sandbox Code Playgroud)

python datetime python-newspaper

5
推荐指数
1
解决办法
685
查看次数

导入错误:没有名为报纸的模块

我正在尝试构建一个 python 程序,该程序将显示某些新闻网站的各种标题。我使用 pip 安装模块报纸,但是当我运行程序时,出现错误:

导入错误:没有名为报纸的模块

有想法该怎么解决这个吗?

python python-newspaper

4
推荐指数
1
解决办法
1万
查看次数

安装报纸时的ImportError

我是python的新手,我正在尝试导入报纸进行文章提取.每当我尝试导入模块时,我都会得到ImportError: cannot import name images.任何人遇到这个问题,并找到了解决方案?

python importerror python-newspaper

3
推荐指数
1
解决办法
1190
查看次数

Python包(报纸)安装错误

尝试安装失败并出现以下错误的软件包。我用谷歌搜索并安装了 setuptools - 仍然遇到同样的错误。

命令:pip install newspaper

Collecting nltk==2.0.5 (from newspaper)
  Using cached nltk-2.0.5.tar.gz
    Complete output from command python setup.py egg_info:
    Downloading http://pypi.python.org/packages/source/d/distribute/distribute-0.6.21.tar.gz
    Extracting in C:\Users\pratik\AppData\Local\Temp\tmp0mun48pu
    Traceback (most recent call last):
      File "c:\users\pratik\appdata\local\temp\pip-build-6gyje7fp\nltk\distribute_setup.py", line 143, in use_setuptools
        raise ImportError
    ImportError

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "C:\Users\pratik\AppData\Local\Temp\pip-build-6gyje7fp\nltk\setup.py", line 23, in <module>
        distribute_setup.use_setuptools()
      File "c:\users\pratik\appdata\local\temp\pip-build-6gyje7fp\nltk\distribute_setup.py", line 145, in use_setuptools
        return _do_download(version, download_base, to_dir, download_delay)
      File …
Run Code Online (Sandbox Code Playgroud)

python python-newspaper

2
推荐指数
1
解决办法
5877
查看次数

ImportError:没有名为'_sqlite3'的模块错误 - 下划线相关性?

我正在使用Python3.4,我最近从python 3.3.2升级.

我正在按照这些说明如何安装报纸,这是一个python库/工具.

https://github.com/codelucas/newspaper
Run Code Online (Sandbox Code Playgroud)

执行此命令后我遇到错误:

curl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py | python3
Run Code Online (Sandbox Code Playgroud)

注意:我在上面的命令中也指定了python3.4并且我得到相同/以下输出/错误:

import sqlite3
  File "/usr/local/lib/python3.4/sqlite3/__init__.py", line 23, in <module>
    from sqlite3.dbapi2 import *
  File "/usr/local/lib/python3.4/sqlite3/dbapi2.py", line 27, in <module>
    from _sqlite3 import *
ImportError: No module named '_sqlite3'
[root@neil bin]# curl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py | python3.4
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   657  100   657    0     0    206      0  0:00:03  0:00:03 --:--:--   206
Traceback (most recent call last):
  File …
Run Code Online (Sandbox Code Playgroud)

python sqlite python-3.x python-newspaper

2
推荐指数
1
解决办法
2639
查看次数

“没有名为tldextract的模块”

在python中尝试了以下代码

from newspaper import Article

#A new article from BBC
url = "http://www.bbc.com/news/magazine-26935867"

#For different language newspaper refer above table
BBC_article = Article(url, language="en") # en for English
Run Code Online (Sandbox Code Playgroud)

而且我得到以下错误

追溯(最近一次通话为最新记录):文件“ news_paper_article.py”,第3行,来自报纸进口文章

文件“ /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/newspaper/ init .py”,第10行,来自.article导入Article,ArticleException

从中输入文件“ /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/newspaper/article.py”,第12行。导入图像文件“ /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/newspaper/images.py”,第21行,来自。导入网址

文件“ /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/newspaper/urls.py”,行16,来自tldextract import tldextract ImportError:没有名为tldextract的模块

这可能是简单的问题,但我才刚刚开始,我们将提供任何帮助。

python python-2.7 python-newspaper

2
推荐指数
1
解决办法
3767
查看次数

使用 Python 和报纸3k lib 进行网页抓取不返回数据

我已经Newspapper3k在我的 Mac 上安装了 Lib sudo pip3 install Newspapper3k。我使用 Python 3。我想返回 Article 对象支持的数据,即 url、日期、标题、文本、摘要和关键字,但我没有得到任何数据:

import newspaper
from newspaper import Article

#creating website for scraping
cnn_paper = newspaper.build('https://www.euronews.com/', memoize_articles=False)

#I have tried for https://www.euronews.com/, https://edition.cnn.com/, https://www.bbc.com/


for article in cnn_paper.articles:

    article_url = article.url #works

    news_article = Article(article_url)#works

    print("OBJECT:", news_article, '\n')#works
    print("URL:", article_url, '\n')#works
    print("DATE:", news_article.publish_date, '\n')#does not work
    print("TITLE:", news_article.title, '\n')#does not work
    print("TEXT:", news_article.text, '\n')#does not work
    print("SUMMARY:", news_article.summary, '\n')#does not work
    print("KEYWORDS:", news_article.keywords, '\n')#does not work
    print() …
Run Code Online (Sandbox Code Playgroud)

python web-scraping python-newspaper newspaper3k

2
推荐指数
1
解决办法
4236
查看次数