2 python url beautifulsoup web-crawler web-scraping
我正在开发一个需要从网站提取所有链接的项目,使用此代码我将从单个 URL 获取所有链接:
import requests
from bs4 import BeautifulSoup, SoupStrainer
source_code = requests.get('https://stackoverflow.com/')
soup = BeautifulSoup(source_code.content, 'lxml')
links = []
for link in soup.find_all('a'):
links.append(str(link))
Run Code Online (Sandbox Code Playgroud)
问题是,如果我想提取所有 URL,我必须编写另一个 for 循环,然后再编写一个......。我想提取该网站及其子域中存在的所有 URL。有什么办法可以做到这一点而不需要编写嵌套吗?即使使用嵌套的 for 编写,我也不知道应该使用多少个 for 来获取所有 URL。
哇,找到解决方案大约需要30分钟,\n我找到了一个简单有效的方法来做到这一点,\n如@\xce\xb1\xd4\x8b\xc9\xb1\xd2\xbd\xd4\x83-\xce \xb1\xd0\xbc\xd1\x94\xd1\x8f\xce\xb9c\xce\xb1\xce\xb7 提到,有时如果你的网站链接到像谷歌等大网站,它不会停止,直到你记住获取完整的数据。\n因此您应该考虑采取一些步骤。
\n\n这里有一个示例代码,它应该可以正常工作,我实际上测试了它,它对我来说很有趣:
\n\nimport requests\nfrom bs4 import BeautifulSoup\nimport re\nimport time\n\nsource_code = requests.get(\'https://stackoverflow.com/\')\nsoup = BeautifulSoup(source_code.content, \'lxml\')\ndata = []\nlinks = []\n\n\ndef remove_duplicates(l): # remove duplicates and unURL string\n for item in l:\n match = re.search("(?P<url>https?://[^\\s]+)", item)\n if match is not None:\n links.append((match.group("url")))\n\n\nfor link in soup.find_all(\'a\', href=True):\n data.append(str(link.get(\'href\')))\nflag = True\nremove_duplicates(data)\nwhile flag:\n try:\n for link in links:\n for j in soup.find_all(\'a\', href=True):\n temp = []\n source_code = requests.get(link)\n soup = BeautifulSoup(source_code.content, \'lxml\')\n temp.append(str(j.get(\'href\')))\n remove_duplicates(temp)\n\n if len(links) > 162: # set limitation to number of URLs\n break\n if len(links) > 162:\n break\n if len(links) > 162:\n break\n except Exception as e:\n print(e)\n if len(links) > 162:\n break\n\nfor url in links:\nprint(url)\nRun Code Online (Sandbox Code Playgroud)\n\n输出将是:
\n\nhttps://stackoverflow.com\nhttps://www.stackoverflowbusiness.com/talent\nhttps://www.stackoverflowbusiness.com/advertising\nhttps://stackoverflow.com/users/login?ssrc=head&returnurl=https%3a%2f%2fstackoverflow.com%2f\nhttps://stackoverflow.com/users/signup?ssrc=head&returnurl=%2fusers%2fstory%2fcurrent\nhttps://stackoverflow.com\nhttps://stackoverflow.com\nhttps://stackoverflow.com/help\nhttps://chat.stackoverflow.com\nhttps://meta.stackoverflow.com\nhttps://stackoverflow.com/users/signup?ssrc=site_switcher&returnurl=%2fusers%2fstory%2fcurrent\nhttps://stackoverflow.com/users/login?ssrc=site_switcher&returnurl=https%3a%2f%2fstackoverflow.com%2f\nhttps://stackexchange.com/sites\nhttps://stackoverflow.blog\nhttps://stackoverflow.com/legal/cookie-policy\nhttps://stackoverflow.com/legal/privacy-policy\nhttps://stackoverflow.com/legal/terms-of-service/public\nhttps://stackoverflow.com/teams\nhttps://stackoverflow.com/teams\nhttps://www.stackoverflowbusiness.com/talent\nhttps://www.stackoverflowbusiness.com/advertising\nhttps://www.g2.com/products/stack-overflow-for-teams/\nhttps://www.g2.com/products/stack-overflow-for-teams/\nhttps://www.fastcompany.com/most-innovative-companies/2019/sectors/enterprise\nhttps://www.stackoverflowbusiness.com/talent\nhttps://www.stackoverflowbusiness.com/advertising\n/sf/ask/3911916011/#55885729\nhttps://insights.stackoverflow.com/\nhttps://stackoverflow.com\nhttps://stackoverflow.com\nhttps://stackoverflow.com/jobs\nhttps://stackoverflow.com/jobs/directory/developer-jobs\nhttps://stackoverflow.com/jobs/salary\nhttps://www.stackoverflowbusiness.com\nhttps://stackoverflow.com/teams\nhttps://www.stackoverflowbusiness.com/talent\nhttps://www.stackoverflowbusiness.com/advertising\nhttps://stackoverflow.com/enterprise\nhttps://stackoverflow.com/company/about\nhttps://stackoverflow.com/company/about\nhttps://stackoverflow.com/company/press\nhttps://stackoverflow.com/company/work-here\nhttps://stackoverflow.com/legal\nhttps://stackoverflow.com/legal/privacy-policy\nhttps://stackoverflow.com/company/contact\nhttps://stackexchange.com\nhttps://stackoverflow.com\nhttps://serverfault.com\nhttps://superuser.com\nhttps://webapps.stackexchange.com\nhttps://askubuntu.com\nhttps://webmasters.stackexchange.com\nhttps://gamedev.stackexchange.com\nhttps://tex.stackexchange.com\nhttps://softwareengineering.stackexchange.com\nhttps://unix.stackexchange.com\nhttps://apple.stackexchange.com\nhttps://wordpress.stackexchange.com\nhttps://gis.stackexchange.com\nhttps://electronics.stackexchange.com\nhttps://android.stackexchange.com\nhttps://security.stackexchange.com\nhttps://dba.stackexchange.com\nhttps://drupal.stackexchange.com\nhttps://sharepoint.stackexchange.com\nhttps://ux.stackexchange.com\nhttps://mathematica.stackexchange.com\nhttps://salesforce.stackexchange.com\nhttps://expressionengine.stackexchange.com\nhttps://pt.stackoverflow.com\nhttps://blender.stackexchange.com\nhttps://networkengineering.stackexchange.com\nhttps://crypto.stackexchange.com\nhttps://codereview.stackexchange.com\nhttps://magento.stackexchange.com\nhttps://softwarerecs.stackexchange.com\nhttps://dsp.stackexchange.com\nhttps://emacs.stackexchange.com\nhttps://raspberrypi.stackexchange.com\nhttps://ru.stackoverflow.com\nhttps://codegolf.stackexchange.com\nhttps://es.stackoverflow.com\nhttps://ethereum.stackexchange.com\nhttps://datascience.stackexchange.com\nhttps://arduino.stackexchange.com\nhttps://bitcoin.stackexchange.com\nhttps://sqa.stackexchange.com\nhttps://sound.stackexchange.com\nhttps://windowsphone.stackexchange.com\nhttps://stackexchange.com/sites#technology\nhttps://photo.stackexchange.com\nhttps://scifi.stackexchange.com\nhttps://graphicdesign.stackexchange.com\nhttps://movies.stackexchange.com\nhttps://music.stackexchange.com\nhttps://worldbuilding.stackexchange.com\nhttps://video.stackexchange.com\nhttps://cooking.stackexchange.com\nhttps://diy.stackexchange.com\nhttps://money.stackexchange.com\nhttps://academia.stackexchange.com\nhttps://law.stackexchange.com\nhttps://fitness.stackexchange.com\nhttps://gardening.stackexchange.com\nhttps://parenting.stackexchange.com\nhttps://stackexchange.com/sites#lifearts\nhttps://english.stackexchange.com\nhttps://skeptics.stackexchange.com\nhttps://judaism.stackexchange.com\nhttps://travel.stackexchange.com\nhttps://christianity.stackexchange.com\nhttps://ell.stackexchange.com\nhttps://japanese.stackexchange.com\nhttps://chinese.stackexchange.com\nhttps://french.stackexchange.com\nhttps://german.stackexchange.com\nhttps://hermeneutics.stackexchange.com\nhttps://history.stackexchange.com\nhttps://spanish.stackexchange.com\nhttps://islam.stackexchange.com\nhttps://rus.stackexchange.com\nhttps://russian.stackexchange.com\nhttps://gaming.stackexchange.com\nhttps://bicycles.stackexchange.com\nhttps://rpg.stackexchange.com\nhttps://anime.stackexchange.com\nhttps://puzzling.stackexchange.com\nhttps://mechanics.stackexchange.com\nhttps://boardgames.stackexchange.com\nhttps://bricks.stackexchange.com\nhttps://homebrew.stackexchange.com\nhttps://martialarts.stackexchange.com\nhttps://outdoors.stackexchange.com\nhttps://poker.stackexchange.com\nhttps://chess.stackexchange.com\nhttps://sports.stackexchange.com\nhttps://stackexchange.com/sites#culturerecreation\nhttps://mathoverflow.net\nhttps://math.stackexchange.com\nhttps://stats.stackexchange.com\nhttps://cstheory.stackexchange.com\nhttps://physics.stackexchange.com\nhttps://chemistry.stackexchange.com\nhttps://biology.stackexchange.com\nhttps://cs.stackexchange.com\nhttps://philosophy.stackexchange.com\nhttps://linguistics.stackexchange.com\nhttps://psychology.stackexchange.com\nhttps://scicomp.stackexchange.com\nhttps://stackexchange.com/sites#science\nhttps://meta.stackexchange.com\nhttps://stackapps.com\nhttps://api.stackexchange.com\nhttps://data.stackexchange.com\nhttps://stackoverflow.blog?blb=1\nhttps://www.facebook.com/officialstackoverflow/\nhttps://twitter.com/stackoverflow\nhttps://linkedin.com/company/stack-overflow\nhttps://creativecommons.org/licenses/by-sa/4.0/\nhttps://stackoverflow.blog/2009/06/25/attribution-required/\nhttps://stackoverflow.com\nhttps://www.stackoverflowbusiness.com/talent\nhttps://www.stackoverflowbusiness.com/advertising\n\nProcess finished with exit code 0\nRun Code Online (Sandbox Code Playgroud)\n\n我将限制设置为 162,您可以在内存允许的情况下随意增加它。
\n| 归档时间: |
|
| 查看次数: |
23678 次 |
| 最近记录: |