简单的网络爬虫我需要消除数组中出现的重复的URL

Question

简单的网络爬虫我需要消除数组中出现的重复的URL

man*_*ans 3 python web-crawler web-scraping

我正在使用数组来存储URL,我需要消除在数组中出现多次的URL,因为我不需要再次抓取相同的URL:

self.level = []  # array where the URL are present 
for link in self.soup.find_all('a'):
    self.level.append(link.get('href'))
    print(self.level)

Run Code Online (Sandbox Code Playgroud)

我需要在抓取此网址之前消除重复的网址.

Answer 1

ale*_*cxe 7

维护一个set网址:

self.level = set()
for link in self.soup.find_all('a'):
    self.level.add(link.get('href'))

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年，10 月前
查看次数：	654 次
最近记录：	10 年，10 月前