Jul*_*rch 4 python loops beautifulsoup
我正在使用Python编写代码以使用Beautiful soup获取URL中的所有'a'标记,然后我使用位置3处的链接,然后我应该关注该链接,我将重复此过程大约18次.我包含了下面的代码,该代码重复了两次.我不能在循环中重复相同的过程18次.任何帮助将不胜感激.
import re
import urllib
from BeautifulSoup import *
htm1= urllib.urlopen('https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html ').read()
soup =BeautifulSoup(htm1)
tags = soup('a')
list1=list()
for tag in tags:
x = tag.get('href', None)
list1.append(x)
M= list1[2]
htm2= urllib.urlopen(M).read()
soup =BeautifulSoup(htm2)
tags1 = soup('a')
list2=list()
for tag1 in tags1:
x2 = tag1.get('href', None)
list2.append(x2)
y= list2[2]
print y
Run Code Online (Sandbox Code Playgroud)
好的,我刚刚编写了这段代码,它正在运行,但我在结果中得到了相同的4个链接.看起来循环中有问题(请注意:我正在尝试循环4次)
import re
import urllib
from BeautifulSoup import *
list1=list()
url = 'https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html'
for i in range (4): # repeat 4 times
htm2= urllib.urlopen(url).read()
soup1=BeautifulSoup(htm2)
tags1= soup1('a')
for tag1 in tags1:
x2 = tag1.get('href', None)
list1.append(x2)
y= list1[2]
if len(x2) < 3: # no 3rd link
break # exit the loop
else:
url=y
print y
Run Code Online (Sandbox Code Playgroud)
我无法想出一种循环重复18次相同过程的方法.
要在Python中重复18次,可以使用for _ in range(18)循环:
#!/usr/bin/env python2
from urllib2 import urlopen
from urlparse import urljoin
from bs4 import BeautifulSoup # $ pip install beautifulsoup4
url = 'http://example.com'
for _ in range(18): # repeat 18 times
soup = BeautifulSoup(urlopen(url))
a = soup.find_all('a', href=True) # all <a href> links
if len(a) < 3: # no 3rd link
break # exit the loop
url = urljoin(url, a[2]['href']) # 3rd link, note: ignore <base href>
Run Code Online (Sandbox Code Playgroud)