小编the*_*t_1的帖子

使用BeautifulSoup和Python获取元标记内容属性

我正在尝试使用python和美丽的汤来提取下面标签的内容部分:

<meta property="og:title" content="Super Fun Event 1" />
<meta property="og:url" content="http://superfunevents.com/events/super-fun-event-1/" />

Run Code Online (Sandbox Code Playgroud)

我正在使用BeautifulSoup来加载页面并找到其他东西(这也从源代码中隐藏的id标签中获取文章id),但我不知道正确的方法来搜索html并找到这些位,我尝试过find和findAll的变种无济于事.代码迭代目前的网址列表...

#!/usr/bin/env python
# -*- coding: utf-8 -*-

#importing the libraries
from urllib import urlopen
from bs4 import BeautifulSoup

def get_data(page_no):
    webpage = urlopen('http://superfunevents.com/?p=' + str(i)).read()
    soup = BeautifulSoup(webpage, "lxml")
    for tag in soup.find_all("article") :
        id = tag.get('id')
        print id
# the hard part that doesn't work - I know this example is well off the mark!        
    title = soup.find("og:title", "content")
    print (title.get_text())
    url = soup.find("og:url", "content")
    print …

Run Code Online (Sandbox Code Playgroud)

html python beautifulsoup web-scraping

the*_*t_1

2016 04-21

29
推荐指数

3
解决办法

3万
查看次数

使用Tweepy从API获取最后的Twitter提及,避免速率限制

我以前有一些很好的工作python,它在Tweepy流监听器上做了自动回复,但是由于8月份Twitter API的变化,它不再有效.

我通过每10秒获取一次最近的提及来重新构建它(理想情况下,它会更少,因为我想做近临时回复),并检查它是否在过去十秒内...如果是该脚本假定它是一条新的推文并回复.

from tweepy import OAuthHandler
from tweepy import API
from datetime import datetime, time, timedelta

consumer_key = ''
consumer_secret = ''
access_token = ''
access_token_secret = ''
account_screen_name = ''
account_user_id = '897579556009332736'

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
twitterApi = API(auth)

mentions = twitterApi.mentions_timeline(count=1)
now = datetime.now()

for mention in mentions:
    if now < (mention.created_at + timedelta(hours=1) + timedelta(seconds=10)):
        print "there's a mention in the last 10 seconds"
        # do magic reply stuff here!
    else:
        print "do …

Run Code Online (Sandbox Code Playgroud)

python twitter rate-limiting tweepy python-2.7

the*_*t_1

lucky-day

6
推荐指数

1
解决办法

533
查看次数

使用 cronjob 运行 Python 进程并检查它是否仍在每分钟运行

我有一个 Python 脚本，我想从 cronjob 运行它，然后每分钟检查它是否仍在运行，如果没有，则重新启动它。

定时任务是：

/usr/bin/python2.7 /home/mydir/public_html/myotherdir/script.py

Run Code Online (Sandbox Code Playgroud)

有一些关于此的信息，但大多数答案并没有真正清楚地详细说明整个过程，例如：

使用 cron 作业检查 python 脚本是否正在运行

例如，在这种情况下，它没有说明如何运行初始进程并记录 PID。不幸的是，这给我留下了很多问题。

因此，谁能给我一个简单的指南来说明如何做到这一点？

例如需要完整的shell 脚本、启动脚本的命令等等。

python cron python-2.7

the*_*t_1

2017 08-19

4
推荐指数

2
解决办法

4102
查看次数

将列表中的URL格式设置为在python中都有一个尾部斜杠

在stackoverflow上有一些类似的问题,但没有一个完全符合我的要求,我的各种尝试似乎都失败了.

我有一个url列表,有些有斜杠,有些没有...我想检查它们并为那些没有的东西添加斜杠.

url_list = ['http://google.com/somedirectory/', 'http://google.com/someotherdirectory/', 'http://google.com/anotherdirectory', 'http://google.com/yetanotherdirectory']

for url in url_list:
    if url[len(url)-1] != "/":
        url = url + "/"
    else:
        url = url

print url_list

Run Code Online (Sandbox Code Playgroud)

返回相同的列表(最后两个网址仍然没有尾部斜杠)

['http://google.com/somedirectory/', 'http://google.com/someotherdirectory/', 'http://google.com/anotherdirectory', 'http://google.com/yetanotherdirectory']

Run Code Online (Sandbox Code Playgroud)

为什么不工作？有任何想法吗？

谢谢 :)

python

the*_*t_1

lucky-day

1
推荐指数

1
解决办法

1719
查看次数