SIM*_*SIM 0 python beautifulsoup web-scraping python-3.x
我在python中编写了一个脚本,从网页中获取每个容器中的一些属性titles及其相应的email地址.当我运行我的脚本时,它只抓取titles但是在email address它刮擦的情况下只有这个文本连接到send eamil按钮.我怎样才能找到那些email addresses存在的东西,因为当我按下它时send email button,它会发送电子邮件.任何有关这方面的帮助将受到高度赞赏.
链接到该网站
这是我到目前为止所尝试的:
import requests
from bs4 import BeautifulSoup
URL = "use_above_link"
def Get_Leads(link):
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
for items in soup.select(".media"):
title = items.select_one(".item-name").text.strip()
try:
email = items.select_one("a[alt^='Contact']").text.strip()
except:
email = ""
print(title,email)
if __name__ == '__main__':
Get_Leads(URL)
Run Code Online (Sandbox Code Playgroud)
结果我喜欢:
Singapore Immigration Specialist SEND EMAIL
Faithful+Gould Pte Ltd SEND EMAIL
PsyAsia International SEND EMAIL
Activpayroll SEND EMAIL
Precursor SEND EMAIL
Run Code Online (Sandbox Code Playgroud)
而不是文本send email,我希望刮掉email address.
网站本身不包含代码中的电子邮件,因此您无法直接删除它们.你能做的是:
我玩这个概念,它对我来说非常好,因为我能够刮掉许多公司的电子邮件地址.这是我做的:
修改了Get_Lead方法.现在,Get_Lead还将抓取网站URL并调用scrape_contact_emails(link)返回电子邮件地址的方法.
def Get_Leads(link):
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
for items in soup.select(".media"):
title = items.select_one(".item-name").text.strip()
try:
website = items.select_one("a[alt^='Visit Website']")['href']
except:
website = ""
companies.append([title,website])
for company,site in companies:
try:
print("Company: "+company+"\nWebsite: "+site+"\n"+scrape_contact_emails(site)+"\n\n--------------------\n\n")
except:
pass
Run Code Online (Sandbox Code Playgroud)
这是从网站上抓取电子邮件地址的方法.首先,它将在主页中搜索电子邮件地址.电子邮件地址很可能出现在主页中,并且必须用于联系目的.如果找不到电子邮件地址,它将搜索"联系我们"页面的URL,并在那里搜索电子邮件地址.
def scrape_contact_emails(link):
res = requests.get(link)
domain = link.split(".")
mailaddr = link
soup = BeautifulSoup(res.text,"lxml")
links = soup.find_all("a")
contact_link = ''
final_result = ""
try:
# Check if there is any email address in the homepage.
emails = soup.find_all(text=re.compile('.*@'+domain[1]+'.'+domain[2].replace("/","")))
emails.sort(key=len)
print(emails[0].replace("\n",""))
final_result = emails[0]
except:
# Searching for Contact Us Page's url.
try:
flag = 0
for link in links:
if "contact" in link.get("href") or "Contact" in link.get("href") or "CONTACT" in link.get("href") or 'contact' in link.text or 'Contact' in link.text or 'CONTACT' in link.text:
if len(link.get("href"))>2 and flag<2:
flag = flag + 1
contact_link = link.get("href")
except:
pass
domain = domain[0]+"."+domain[1]+"."+domain[2]
if(len(contact_link)<len(domain)):
domain = domain+contact_link.replace("/","")
else:
domain = contact_link
try:
# Check if there is any email address in the Contact Us Page.
res = requests.get(domain)
soup = BeautifulSoup(res.text,"lxml")
emails = soup.find_all(text=re.compile('.*@'+mailaddr[7:].replace("/","")))
emails.sort(key=len)
try:
print(emails[0].replace("\n",""))
final_result = emails[0]
return final_result
except:
pass
except Exception as e:
pass
return ""
Run Code Online (Sandbox Code Playgroud)
这是我得到的结果的一小部分.我无法为每家公司提取电子邮件地址,因为有些网站已经对像验证码等机器人进行了保护.我很确定这些代码并不完美,只是一个原型但可以进行很多改进.希望这会帮助你.
info@zacknzul,com
Company: Zack & Zul Business Broker
Website: http://www.zacknzul.com/
--------------------
sales@ats.com.sg
Company: ATS IT Solutions Pte Ltd - Guarantees 100% Satisfaction & W...
Website: http://www.ats.com.sg
--------------------
Info@britcham.org.sg
Company: British Chamber of Commerce - Singapore
Website: http://www.britcham.org.sg/
--------------------
Company: International Enterprise Singapore
Website: http://www.iesingapore.gov.sg/
--------------------
Company: IBS Business Consulting Pte. Ltd
Website: http://www.consultibs.sg/
--------------------
Company: Positive Performance Consulting
Website: https://www.positiveconsulting.sg
--------------------
enquiries@jaba.com.sg
Company: Jacob Business Armour Pte Ltd
Website: http://www.jaba.com.sg/
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
1214 次 |
| 最近记录: |