关注网站链接的重复流程(BeautifulSoup)

Jul*_*rch 4 python loops beautifulsoup

我正在使用Python编写代码以使用Beautiful soup获取URL中的所有'a'标记,然后我使用位置3处的链接,然后我应该关注该链接,我将重复此过程大约18次.我包含了下面的代码,该代码重复了两次.我不能在循环中重复相同的过程18次.任何帮助将不胜感激.

import re
import urllib

from BeautifulSoup import *
htm1= urllib.urlopen('https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html ').read()
soup =BeautifulSoup(htm1)
tags = soup('a')
list1=list()
for tag in tags:
    x = tag.get('href', None)
    list1.append(x)

M= list1[2]

htm2= urllib.urlopen(M).read()
soup =BeautifulSoup(htm2)
tags1 = soup('a')
list2=list()
for tag1 in tags1:
    x2 = tag1.get('href', None)
    list2.append(x2)

y= list2[2]
print y
Run Code Online (Sandbox Code Playgroud)

好的,我刚刚编写了这段代码,它正在运行,但我在结果中得到了相同的4个链接.看起来循环中有问题(请注意:我正在尝试循环4次)

import re
import urllib
from BeautifulSoup import *
list1=list()
url = 'https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html'

for i in range (4):  # repeat 4 times
    htm2= urllib.urlopen(url).read()
    soup1=BeautifulSoup(htm2)
    tags1= soup1('a')
    for tag1 in tags1:
        x2 = tag1.get('href', None)
        list1.append(x2)
    y= list1[2]
    if len(x2) < 3:  # no 3rd link
        break  # exit the loop
    else:
        url=y             
    print y
Run Code Online (Sandbox Code Playgroud)

jfs*_*jfs 8

我无法想出一种循环重复18次相同过程的方法.

要在Python中重复18次,可以使用for _ in range(18)循环:

#!/usr/bin/env python2
from urllib2 import urlopen
from urlparse import urljoin
from bs4 import BeautifulSoup # $ pip install beautifulsoup4

url = 'http://example.com'
for _ in range(18):  # repeat 18 times
    soup = BeautifulSoup(urlopen(url))
    a = soup.find_all('a', href=True)  # all <a href> links
    if len(a) < 3:  # no 3rd link
        break  # exit the loop
    url = urljoin(url, a[2]['href'])  # 3rd link, note: ignore <base href>
Run Code Online (Sandbox Code Playgroud)