90a*_*yss 3 python reddit praw
我想要一个在后台运行的脚本,该脚本将每小时左右获取 subreddit 数据。现在,由于我不想在数据库中出现重复的条目,所以我想根据created_utc过滤搜索结果
这就是我目前所拥有的:
r = praw.Reddit(user_agent='soc')
submissions = r.get_subreddit('soccer').get_hot()
Run Code Online (Sandbox Code Playgroud)
这就是我想要的:
r = praw.Reddit(user_agent='soc')
submissions = r.get_subreddit('soccer').get_hot(created_utc > '2016-02-18 14:33:14.000')
Run Code Online (Sandbox Code Playgroud)
有哪些方法可以实现这一目标?
该类和Reddit API 都SubReddit没有您想要的基于日期的过滤方法,因此这里有一个选择:
在将结果放入数据库之前,先用 Python 过滤结果。get_hot并get_new返回生成器对象,因此您可以使用如下列表理解:
from datetime import datetime, timedelta
import praw
# assuming you run this script every hour
an_hour_ago = datetime.utcnow() - timedelta(hours=1)
r = praw.Reddit(user_agent='soc')
submissions = r.get_subreddit('soccer').get_new()
submissions_list = [
# iterate through the submissions generator object
x for x in submissions
# add item if item.created_utc is newer than an hour ago
if datetime.utcfromtimestamp(x.created_utc) >= an_hour_ago
]
Run Code Online (Sandbox Code Playgroud)
默认情况下,Reddit 仅返回 25 个列表,因此如果您需要更多列表,则必须对其进行分页。
limit = 100 # Reddit maximum limit
total_list = []
submissions = r.get_subreddit('soccer').get_new(limit=limit)
submissions_list = [
x for x in submissions
if datetime.utcfromtimestamp(x.created_utc) >= an_hour_ago
]
total_list += submissions_list
if len(submissions_list) == limit:
submissions = r.get_subreddit('soccer').get_new(
# get limit of items past the last item in the total list
limit=100, params={"after": total_list[-1].fullname}
)
submissions_list_2 = [
# iterate through the submissions generator object
x for x in submissions
# add item if item.created_utc is newer than an hour ago
if datetime.utcfromtimestamp(x.created_utc) >= an_hour_ago
]
total_list += submissions_list_2
print total_list
Run Code Online (Sandbox Code Playgroud)
如果提交量大于 200,则必须将其放入递归函数中,如下所示:subreddit_latest.py
| 归档时间: |
|
| 查看次数: |
6079 次 |
| 最近记录: |