如何在python中找到网站的反向链接

Question

如何在python中找到网站的反向链接

我有点陷入这种情况,我想找到网站的反向链接,我找不到怎么做,这是我的正则表达式:

readh = BeautifulSoup(urllib.urlopen("http://www.google.com/").read()).findAll("a",href=re.compile("^http"))

Run Code Online (Sandbox Code Playgroud)

我想要做的是,找到反向链接,是找到以http开头但不包含google链接的链接,我无法弄清楚如何管理这个？

Answer 1

7st*_*tud 4

from BeautifulSoup import BeautifulSoup
import re

html = """
<div>hello</div>
<a href="/index.html">Not this one</a>"
<a href="http://google.com">Link 1</a>
<a href="http:/amazon.com">Link 2</a>
"""

def processor(tag):
    href = tag.get('href')
    if not href: return False
    return True if (href.find("google") == -1) else False

soup = BeautifulSoup(html)
back_links = soup.findAll(processor, href=re.compile(r"^http"))
print back_links

--output:--
[<a href="http:/amazon.com">Link 2</a>]

Run Code Online (Sandbox Code Playgroud)

但是，获取所有以 http 开头的链接，然后在这些链接中搜索 href 中不包含“google”的链接可能会更有效：

http_links = soup.findAll('a', href=re.compile(r"^http"))
results = [a for a in http_links if a['href'].find('google') == -1]
print results

--output:--
[<a href="http:/amazon.com">Link 2</a>]

Run Code Online (Sandbox Code Playgroud)

归档时间：	12 年，3 月前
查看次数：	2166 次
最近记录：	12 年，3 月前