无法从网页中提取电子邮件地址

Question

无法从网页中提取电子邮件地址

SIM*_*SIM 0 python beautifulsoup web-scraping python-3.x

我在python中编写了一个脚本,从网页中获取每个容器中的一些属性titles及其相应的email地址.当我运行我的脚本时,它只抓取titles但是在email address它刮擦的情况下只有这个文本连接到send eamil按钮.我怎样才能找到那些email addresses存在的东西,因为当我按下它时send email button,它会发送电子邮件.任何有关这方面的帮助将受到高度赞赏.

链接到该网站

这是我到目前为止所尝试的:

import requests
from bs4 import BeautifulSoup

URL = "use_above_link"

def Get_Leads(link):
    res = requests.get(link)
    soup = BeautifulSoup(res.text,"lxml")
    for items in soup.select(".media"):
        title = items.select_one(".item-name").text.strip()
        try:
            email = items.select_one("a[alt^='Contact']").text.strip()
        except:
            email = ""
        print(title,email)

if __name__ == '__main__':
    Get_Leads(URL)

Run Code Online (Sandbox Code Playgroud)

结果我喜欢:

Singapore Immigration Specialist SEND EMAIL
Faithful+Gould Pte Ltd SEND EMAIL
PsyAsia International SEND EMAIL
Activpayroll SEND EMAIL
Precursor SEND EMAIL

Run Code Online (Sandbox Code Playgroud)

而不是文本send email,我希望刮掉email address.

Answer 1

Jav*_*pse 6

网站本身不包含代码中的电子邮件,因此您无法直接删除它们.你能做的是:

从"访问网站"链接收集公司网站的链接.
刮掉这些网站的主页,搜索是否有任何联系电子邮件地址.
如果您没有找到任何电子邮件地址,请搜索"联系我们"页面的链接.
打开"联系我们"页面,然后从那里获取电子邮件地址.

我玩这个概念,它对我来说非常好,因为我能够刮掉许多公司的电子邮件地址.这是我做的:

报废公司网站的URL

修改了Get_Lead方法.现在,Get_Lead还将抓取网站URL并调用scrape_contact_emails(link)返回电子邮件地址的方法.

def Get_Leads(link):
    res = requests.get(link)
    soup = BeautifulSoup(res.text,"lxml")
    for items in soup.select(".media"):
        title = items.select_one(".item-name").text.strip()
        try:
            website = items.select_one("a[alt^='Visit Website']")['href']
        except:
            website = ""
        companies.append([title,website])
        for company,site in companies:
            try:
                print("Company: "+company+"\nWebsite: "+site+"\n"+scrape_contact_emails(site)+"\n\n--------------------\n\n")
            except:
                pass

Run Code Online (Sandbox Code Playgroud)

报废电子邮件

这是从网站上抓取电子邮件地址的方法.首先,它将在主页中搜索电子邮件地址.电子邮件地址很可能出现在主页中,并且必须用于联系目的.如果找不到电子邮件地址,它将搜索"联系我们"页面的URL,并在那里搜索电子邮件地址.

def scrape_contact_emails(link):
    res = requests.get(link)
    domain = link.split(".")
    mailaddr = link
    soup = BeautifulSoup(res.text,"lxml")
    links = soup.find_all("a")
    contact_link = ''
    final_result = ""
    try:
        # Check if there is any email address in the homepage. 
        emails = soup.find_all(text=re.compile('.*@'+domain[1]+'.'+domain[2].replace("/","")))
        emails.sort(key=len)
        print(emails[0].replace("\n",""))
        final_result = emails[0]
    except:
        # Searching for Contact Us Page's url.
        try:
            flag = 0
            for link in links:
                if "contact" in link.get("href") or "Contact" in link.get("href") or "CONTACT" in link.get("href") or 'contact' in link.text or 'Contact' in link.text or 'CONTACT' in link.text:
                    if len(link.get("href"))>2 and flag<2:
                        flag = flag + 1
                        contact_link = link.get("href")

        except:
            pass

        domain = domain[0]+"."+domain[1]+"."+domain[2]
        if(len(contact_link)<len(domain)):
            domain = domain+contact_link.replace("/","")
        else:
            domain = contact_link

        try:
            # Check if there is any email address in the Contact Us Page. 
            res = requests.get(domain)
            soup = BeautifulSoup(res.text,"lxml")
            emails = soup.find_all(text=re.compile('.*@'+mailaddr[7:].replace("/","")))
            emails.sort(key=len)
            try:
                print(emails[0].replace("\n",""))
                final_result = emails[0]
                return final_result
            except:
                pass
        except Exception as e:
            pass

    return ""

Run Code Online (Sandbox Code Playgroud)

产量

这是我得到的结果的一小部分.我无法为每家公司提取电子邮件地址,因为有些网站已经对像验证码等机器人进行了保护.我很确定这些代码并不完美,只是一个原型但可以进行很多改进.希望这会帮助你.

info@zacknzul,com
Company: Zack & Zul Business Broker
Website: http://www.zacknzul.com/


--------------------


 sales@ats.com.sg
Company: ATS IT Solutions Pte Ltd - Guarantees 100% Satisfaction & W...
Website: http://www.ats.com.sg


--------------------


Info@britcham.org.sg
Company: British Chamber of Commerce - Singapore
Website: http://www.britcham.org.sg/


--------------------


Company: International Enterprise Singapore
Website: http://www.iesingapore.gov.sg/


--------------------


Company: IBS Business Consulting Pte. Ltd
Website: http://www.consultibs.sg/


--------------------


Company: Positive Performance Consulting
Website: https://www.positiveconsulting.sg


--------------------


enquiries@jaba.com.sg
Company: Jacob Business Armour Pte Ltd
Website: http://www.jaba.com.sg/

Run Code Online (Sandbox Code Playgroud)

而不是脚本中的最大行,尝试在link.text.lower()中使用`if"contact":absolutelink = urljoin(sitelink,link.get('href'))`. (3认同)

归档时间：	7 年，7 月前
查看次数：	1214 次
最近记录：	7 年，7 月前