SIM*_*SIM 9 python function beautifulsoup web-scraping python-3.x
我已经在python中编写了一个脚本来使用函数从网站的登录页面中删除所有names
和links
相关的脚本.get_links()
.然后我创建了另一个功能.get_info()
来到达另一个页面(使用从第一个函数派生的链接),以便从那里刮取电话号码.
我根本不需要创建第二个功能如果我的目标是解析该网页中的两个项目,因为它们已经在着陆页中可用.
但是,我希望我的解析器表现的方式是names
在第二个函数中打印(从第一个函数开始)phone numbers
.最重要的是,我不想for loop
在第二个函数中踢出定义.如果for loop
不在第二个功能中那么问题就不会出现.没有使用for loop
我已经可以获得所需的输出.
到目前为止这是我的脚本:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
url = "https://potguide.com/alaska/marijuana-dispensaries/"
def get_links(link):
session = requests.Session()
session.headers['User-Agent'] = 'Mozilla/5.0'
r = session.get(link)
soup = BeautifulSoup(r.text,"lxml")
for items in soup.select("#StateStores .basic-listing"):
name = items.select_one("h4 a").text
namelink = urljoin(link,items.select_one("h4 a").get("href")) ##making it a fully qualified url
get_info(session,name,namelink) ##passing session in order to reuse it
def get_info(session,title,url):
r = session.get(url)
soup = BeautifulSoup(r.text,"lxml")
for items in soup.select("ul.list-unstyled"): ##if I did not use for loop I could get the output as desired.
try:
phone = items.select_one("a[href^='tel:']").text
except:
phone = ""
print(title,phone)
if __name__ == '__main__':
get_links(url)
Run Code Online (Sandbox Code Playgroud)
我输出的输出:
AK Frost
AK Frost
AK Frost
AK Frost
AK Frost
AK Frost (907) 563-9333
AK Frost
AK Frost
AK Frost (907) 563-9333
AK Frost
AK Fuzzy Budz
AK Fuzzy Budz (907) 644-2838
AK Fuzzy Budz
AK Fuzzy Budz
AK Fuzzy Budz (907) 644-2838
Run Code Online (Sandbox Code Playgroud)
我的预期产量:
AK Frost (907) 563-9333
AK Fuzzy Budz (907) 644-2838
Run Code Online (Sandbox Code Playgroud)
如果目标只是获得预期的输出,这应该工作:
def get_info(session,title,url):
r = session.get(url)
soup = BeautifulSoup(r.text,"lxml")
for items in soup.select("ul.list-unstyled"):
try:
phone = items.select_one("a[href^='tel:']").text
except:
# skip item and continue
continue
else:
# exception wasn't rised, you have the phone
print(title,phone)
break
Run Code Online (Sandbox Code Playgroud)
我认为子页面中的选择ul.list-unstyled
范围太广,其中有太多您实际上并不想要的内容。
如果您确实只需要电话号码,可以直接搜索a
href 以“tel:”开头的标签。问题仍然是这些网站以这种方式列出多个号码,通常是 2 个,其中一个不可见。看得见的东西似乎总是在下面div.col-md-3
。我试过这个:
def get_info(session,title,url):
r = session.get(url)
soup = BeautifulSoup(r.text,"lxml")
for a_phone in soup.select("div.col-md-3 a[href^='tel:']"):
print(title, a_phone.text)
Run Code Online (Sandbox Code Playgroud)
并得到以下结果:
AK Frost (907) 563-9333
AK Fuzzy Budz (907) 644-2838
AK Joint (907) 522-5222
AK Slow Burn (907) 868-1450
Alaska Fireweed (907) 258-9333
Alaskabuds (907) 334-6420
Alaskan Leaf (907) 770-0262
Alaska's Green Light District (907) 644-2839
AM Delight (907) 229-1730
Arctic Herbery (907) 222-1466
Cannabaska (907) 375-9333
Catalyst Cannabis Company (907) 344-0668
Dankorage (907) 279-3265
Enlighten Alaska (907) 290-8559
Great Northern Cannabis (907) 929-9333
Hillside Natural Wellness (907) 868-8639
Hollyweed 907 (907) 929-3331
Raspberry Roots (907) 522-2450
Satori (907) 222-5420
The House of Green (907) 929-3105
Uncle Herb's (907) 561-4372
The Green Spot (907) 354-7044
Denali's Cannabis Cache (907) 683-2633
GOOD (907) 452-5463
Goodsinse (907) 347-7689
Grass Station 49 (907) 374-4420
Green Life Supply (907) 374-4769
One Hit Wonder (844) 420-1448
Pakalolo Supply Company (907) 479-9000
Rebel Roots (907) 455-4055
True Dank (907) 451-4516
The Herbal Cache (907) 783-0420
Denali 420 Recreationals (907) 892-9333
Glacier Valley Shoppe (907) 419-7943
Green Elephant (907) 290-8400
Rainforest Farms (907) 209-2670
The Fireweed Factory (907) 957-2670
Red Run Cannabis Company (907) 283-0800
Cannabis Corner (907) 225-4420
Rainforest Cannabis (907) 247-9333
The Stoney Moose (907) 617-8973
Chena Cannabis (907) 488-0489
The 420 (907) 772-3673
Green Leaf (907) 623-0332
Weed Dudes (907) 623-0605
Remedy Shoppe (907) 983-3345
Fat Tops (907) 953-2470
High Bush Buds (907) 953-9393
Pine Street Cannabis Company (907) 260-3330
Permafrost Distributors (907) 260-7584
Hilltop Premium Green (907) 745-4425
The High Expedition Company (907) 733-0911
Herbal Outfitters (907) 835-4201
Bad Gramm3r (907) 357-0420
Green Degree (907) 376-3155
Green Jar (907) 631-3800
Rosebuds Shatter House (907) 376-9334
Happy Cannabis (907) 305-0292
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
344 次 |
最近记录: |