我用Python3和newspaper库.据说这个库可以创建一个Source对象,它是新闻网站的抽象.但是如果我只需要某个类别的抽象怎么办呢?
例如,当我使用这个网址时,我想获得该'technology'类别的所有文章.相反,我从中获取文章'politics'.
我认为在创建Source对象时,报纸只使用域名,在我的例子中是这样www.kyivpost.com.
有没有办法让它像网址一样工作http://www.kyivpost.com/technology/?
我正在尝试使用 googlesearch 和 news3k python 包的组合来获取文章列表。使用 article.parse 时,我最终得到一个错误:news.article.ArticleException:文章download()因 403 客户端错误而失败:网址禁止:https : //www.newsweek.com/donald-trump-hillary-clinton-2020-拉力奥兰多-1444697在 URL https://www.newsweek.com/donald-trump-hillary-clinton-2020-rally-orlando-1444697
我尝试在执行脚本时以管理员身份运行,并且链接在浏览器中直接打开时有效。
这是我的代码:
import googlesearch
from newspaper import Article
query = "trump"
urlList = []
for j in googlesearch.search_news(query, tld="com", num=500, stop=200, pause=.01):
urlList.append(j)
print(urlList)
articleList = []
for i in urlList:
article = Article(i)
article.download()
article.html
article.parse()
articleList.append(article.text)
print(article.text)
Run Code Online (Sandbox Code Playgroud)
这是我的完整错误输出:
Traceback (most recent call last):
File "C:/Users/andre/PycharmProjects/StockBot/WebCrawlerTest.py", line 31, in <module>
article.parse()
File "C:\Users\andre\AppData\Local\Programs\Python\Python37\lib\site-packages\newspaper\article.py", line 191, in parse
self.throw_if_not_downloaded_verbose()
File "C:\Users\andre\AppData\Local\Programs\Python\Python37\lib\site-packages\newspaper\article.py", line …Run Code Online (Sandbox Code Playgroud) 假设我有新闻文章的本地副本。我怎样才能在这些文章上运行报纸?根据文档,报纸库的正常使用是这样的:
from newspaper import Article
url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
article.download()
article = Article(url)
article.parse()
# ...
Run Code Online (Sandbox Code Playgroud)
就我而言,我不需要从网页下载文章,因为我已经有了该页面的本地副本。如何在网页的本地副本上使用报纸?
我最近一直在使用报纸图书馆。我发现的唯一问题是我什么时候article.publish_date总是得到None。
class NewsArticle:
def __init__(self,url):
self.article = Article(url)
self.article.download()
self.article.parse()
self.article.nlp()
def getKeywords(self):
x = self.article.keywords
for i in range(0,len(x)):
x[i] = x[i].encode('ascii', 'ignore')
return x
return self.article.keywords
def getSummary(self):
return self.article.summary.encode('ascii', 'ignore')
def getAuthors(self):
x = self.article.authors
for i in range(0,len(x)):
x[i] = x[i].encode('ascii', 'ignore')
return x
def thumbnail_url(self):
return self.article.top_image.encode('ascii', 'ignore')
def date_made(self):
print self.article.publish_date
return self.article.publish_date
def get_videos(self):
x=self.article.movies
for i in range(0,len(x)):
x[i] = x[i].encode('ascii', 'ignore')
return x
def get_title(self): …Run Code Online (Sandbox Code Playgroud) 我正在尝试构建一个 python 程序,该程序将显示某些新闻网站的各种标题。我使用 pip 安装模块报纸,但是当我运行程序时,出现错误:
导入错误:没有名为报纸的模块
有想法该怎么解决这个吗?
我是python的新手,我正在尝试导入报纸进行文章提取.每当我尝试导入模块时,我都会得到ImportError: cannot import name images.任何人遇到这个问题,并找到了解决方案?
尝试安装失败并出现以下错误的软件包。我用谷歌搜索并安装了 setuptools - 仍然遇到同样的错误。
命令:pip install newspaper
Collecting nltk==2.0.5 (from newspaper)
Using cached nltk-2.0.5.tar.gz
Complete output from command python setup.py egg_info:
Downloading http://pypi.python.org/packages/source/d/distribute/distribute-0.6.21.tar.gz
Extracting in C:\Users\pratik\AppData\Local\Temp\tmp0mun48pu
Traceback (most recent call last):
File "c:\users\pratik\appdata\local\temp\pip-build-6gyje7fp\nltk\distribute_setup.py", line 143, in use_setuptools
raise ImportError
ImportError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\Users\pratik\AppData\Local\Temp\pip-build-6gyje7fp\nltk\setup.py", line 23, in <module>
distribute_setup.use_setuptools()
File "c:\users\pratik\appdata\local\temp\pip-build-6gyje7fp\nltk\distribute_setup.py", line 145, in use_setuptools
return _do_download(version, download_base, to_dir, download_delay)
File …Run Code Online (Sandbox Code Playgroud) 我正在使用Python3.4,我最近从python 3.3.2升级.
我正在按照这些说明如何安装报纸,这是一个python库/工具.
https://github.com/codelucas/newspaper
Run Code Online (Sandbox Code Playgroud)
执行此命令后我遇到错误:
curl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py | python3
Run Code Online (Sandbox Code Playgroud)
注意:我在上面的命令中也指定了python3.4并且我得到相同/以下输出/错误:
import sqlite3
File "/usr/local/lib/python3.4/sqlite3/__init__.py", line 23, in <module>
from sqlite3.dbapi2 import *
File "/usr/local/lib/python3.4/sqlite3/dbapi2.py", line 27, in <module>
from _sqlite3 import *
ImportError: No module named '_sqlite3'
[root@neil bin]# curl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py | python3.4
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 657 100 657 0 0 206 0 0:00:03 0:00:03 --:--:-- 206
Traceback (most recent call last):
File …Run Code Online (Sandbox Code Playgroud) 我在python中尝试了以下代码:
from newspaper import Article
#A new article from BBC
url = "http://www.bbc.com/news/magazine-26935867"
#For different language newspaper refer above table
BBC_article = Article(url, language="en") # en for English
Run Code Online (Sandbox Code Playgroud)
而且我得到以下错误:
追溯(最近一次通话为最新记录):文件“ news_paper_article.py”,第3行,来自报纸进口文章
文件“ /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/newspaper/ init .py”,第10行,来自.article导入Article,ArticleException
从中输入文件“ /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/newspaper/article.py”,第12行。导入图像文件“ /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/newspaper/images.py”,第21行,来自。导入网址
文件“ /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/newspaper/urls.py”,行16,来自tldextract import tldextract ImportError:没有名为tldextract的模块
这可能是简单的问题,但我才刚刚开始,我们将提供任何帮助。
我已经Newspapper3k在我的 Mac 上安装了 Lib sudo pip3 install Newspapper3k。我使用 Python 3。我想返回 Article 对象支持的数据,即 url、日期、标题、文本、摘要和关键字,但我没有得到任何数据:
import newspaper
from newspaper import Article
#creating website for scraping
cnn_paper = newspaper.build('https://www.euronews.com/', memoize_articles=False)
#I have tried for https://www.euronews.com/, https://edition.cnn.com/, https://www.bbc.com/
for article in cnn_paper.articles:
article_url = article.url #works
news_article = Article(article_url)#works
print("OBJECT:", news_article, '\n')#works
print("URL:", article_url, '\n')#works
print("DATE:", news_article.publish_date, '\n')#does not work
print("TITLE:", news_article.title, '\n')#does not work
print("TEXT:", news_article.text, '\n')#does not work
print("SUMMARY:", news_article.summary, '\n')#does not work
print("KEYWORDS:", news_article.keywords, '\n')#does not work
print() …Run Code Online (Sandbox Code Playgroud) python ×10
python-newspaper ×10
python-3.x ×2
web-scraping ×2
datetime ×1
importerror ×1
newspaper3k ×1
parsing ×1
python-2.7 ×1
sqlite ×1
url ×1
web ×1