BeautifulSoup 获取列表的 href - 需要简化脚本 - 替换多处理

tha*_*nen 3 python beautifulsoup html-parsing web-scraping

我有以下汤:

接下来...我想从中提取href,“some_url”

我想提取 href,“some_url”

以及此页面上列出的页面的完整列表:https://www.catholic-hierarchy.org/diocese/laa.html

注意:有很多子页面的链接:我需要解析它们。目前:获取所有数据:-dioceses -Urls -description -contact-data -etc。等。

下面的示例将获取教区的所有 URL,获取有关每个教区的一些信息并创建最终的数据帧。为了加速进程 multiprocessing.Pool 的使用:

但是等等:如何在没有多处理支持的情况下让这个刮刀运行!?我想在Colab中运行它- 因此需要摆脱多处理功能。

如何实现这一点..!?

import requests
from bs4 import BeautifulSoup
from multiprocessing import Pool


def get_dioceses_urls(section_url):
    dioceses_urls = set()

    while True:
        print(section_url)

        soup = BeautifulSoup(
            requests.get(section_url, headers=headers).content, "lxml"
        )
        for a in soup.select('ul a[href^="d"]'):
            dioceses_urls.add(
                "https://www.catholic-hierarchy.org/diocese/" + a["href"]
            )

        # is there Next Page button?
        next_page = soup.select_one('a:has(img[alt="[Next Page]"])')
        if next_page:
            section_url = (
                "https://www.catholic-hierarchy.org/diocese/"
                + next_page["href"]
            )
        else:
            break

    return dioceses_urls


def get_diocese_info(url):
    print(url)

    soup = BeautifulSoup(requests.get(url, headers=headers).content, "html5lib")

    data = {
        "Title 1": soup.h1.get_text(strip=True),
        "Title 2": soup.h2.get_text(strip=True),
        "Title 3": soup.h3.get_text(strip=True) if soup.h3 else "-",
        "URL": url,
    }

    li = soup.find(
        lambda tag: tag.name == "li"
        and "type of jurisdiction:" in tag.text.lower()
        and tag.find() is None
    )
    if li:
        for l in li.find_previous("ul").find_all("li"):
            t = l.get_text(strip=True, separator=" ")
            if ":" in t:
                k, v = t.split(":", maxsplit=1)
                data[k.strip()] = v.strip()

    # get other info about the diocese
    # ...

    return data


if __name__ == "__main__":
    headers = {
        "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:99.0) Gecko/20100101 Firefox/99.0"
    }

    # get main sections:
    url = "https://www.catholic-hierarchy.org/diocese/laa.html"
    soup = BeautifulSoup(
        requests.get(url, headers=headers).content, "html.parser"
    )

    main_sections = [url]
    for a in soup.select("a[target='_parent']"):
        main_sections.append(
            "https://www.catholic-hierarchy.org/diocese/" + a["href"]
        )

    all_data, dioceses_urls = [], set()
    with Pool() as pool:
        # get all dioceses urls:
        for urls in pool.imap_unordered(get_dioceses_urls, main_sections):
            dioceses_urls.update(urls)

        # get info about all dioceses:
        for info in pool.imap_unordered(get_diocese_info, dioceses_urls):
            all_data.append(info)

    # create dataframe from the info about dioceses
    df = pd.DataFrame(all_data).sort_values("Title 1")

    # save it to csv file
    df.to_csv("data.csv", index=False)
    print(df.head().to_markdown())
Run Code Online (Sandbox Code Playgroud)

更新:如果我在 colab 上运行脚本,看看我会得到什么:

https://www.catholic-hierarchy.org/diocese/laa.htmlhttps://www.catholic-hierarchy.org/diocese/lab.html

---------------------------------------------------------------------------
RemoteTraceback                           Traceback (most recent call last)
RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "<ipython-input-1-f5ea34a0190f>", line 21, in get_dioceses_urls
    next_page = soup.select_one('a:has(img[alt="[Next Page]"])')
  File "/usr/local/lib/python3.7/dist-packages/bs4/element.py", line 1403, in select_one
    value = self.select(selector, limit=1)
  File "/usr/local/lib/python3.7/dist-packages/bs4/element.py", line 1528, in select
    'Only the following pseudo-classes are implemented: nth-of-type.')
NotImplementedError: Only the following pseudo-classes are implemented: nth-of-type.
"""

The above exception was the direct cause of the following exception:

NotImplementedError                       Traceback (most recent call last)
<ipython-input-1-f5ea34a0190f> in <module>
     81     with Pool() as pool:
     82         # get all dioceses urls:
---> 83         for urls in pool.imap_unordered(get_dioceses_urls, main_sections):
     84             dioceses_urls.update(urls)
     85 

/usr/lib/python3.7/multiprocessing/pool.py in next(self, timeout)
    746         if success:
    747             return value
--> 748         raise value
    749 
    750     __next__ = next                    # XXX

NotImplementedError: Only the following pseudo-classes are implemented: nth-of-type.
Run Code Online (Sandbox Code Playgroud)

Bar*_*pus 8

以下是以异步方式获取该信息的一种方法(应该适用于 Colab 笔记本)。我从网站的不同部分获得了教区网址(结构化视图 - 世界地区)。我希望那里的教区计数与信件列表中的计数相匹配。

\n
from httpx import Client, AsyncClient, Limits\nfrom bs4 import BeautifulSoup as bs\nimport pandas as pd\nimport re\nfrom datetime import datetime\nimport asyncio\nimport nest_asyncio\n\nnest_asyncio.apply()\n\nheaders = {\n\'User-Agent\': \'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36\'\n}\n\nbig_df_list = []\n\ndef all_dioceses():\n    dioceses = []\n    root_links = [f\'https://www.catholic-hierarchy.org/diocese/qview{x}.html\' for x in range(1, 8)]\n    with Client(headers=headers, timeout=60.0, follow_redirects=True) as client:\n        for x in root_links:\n            r = client.get(x)\n            soup = bs(r.text)\n            soup.select_one(\'ul#menu2\').decompose()\n            for link in soup.select(\'ul > li > a\'):\n                dioceses.append(\'https://www.catholic-hierarchy.org/diocese/\' + link.get(\'href\'))\n    return dioceses\n# print(all_dioceses())\n\nasync def get_diocese_info(url):\n    async with AsyncClient(headers=headers, timeout=60.0, follow_redirects=True) as client:\n        try:\n            r = await client.get(url)\n            soup = bs(r.text)\n            d_name = soup.select_one(\'h1[align="center"]\').get_text(strip=True)\n            info_table = soup.select_one(\'div[id="d1"] > table\')\n            d_bishops = \' | \'.join([x.get_text(strip=True) for x in info_table.select(\'td\')[0].select(\'li\')])\n            d_extra_info = \' | \'.join([x.get_text(strip=True) for x in info_table.select(\'td\')[1].select(\'li\')])\n            big_df_list.append((d_name, d_bishops, d_extra_info, url))\n            print(\'done\', d_name)\n        except Exception as e:\n            print(url, e)\n\nasync def scrape_dioceses():\n    start_time = datetime.now()\n    tasks = asyncio.Queue()\n    for x in all_dioceses():\n        tasks.put_nowait(get_diocese_info(x))\n\n    async def worker():\n        while not tasks.empty():\n            await tasks.get_nowait()\n            \n    await asyncio.gather(*[worker() for _ in range(100)])\n    end_time = datetime.now()\n    duration = end_time - start_time\n    print(\'diocese scraping took\', duration)\n\nasyncio.run(scrape_dioceses())\ndf = pd.DataFrame(big_df_list, columns = [\'Name\', \'Bishops\', \'Info\', \'Url\'])\nprint(df)\n
Run Code Online (Sandbox Code Playgroud)\n

终端结果:

\n
done Eparchy of Mississauga (Syro-Malabar)\ndone Eparchy of Mar Addai of Toronto (Chaldean)\ndone Eparchy of Saint-Sauveur de Montr\xef\xbf\xbdal (Melkite Greek)\ndone Diocese of Calgary\ndone Archdiocese of Winnipeg\n[...]\ndiocese scraping took 0:03:02.366096\n\nName    Bishops Info    Url\n0   Eparchy of Mississauga (Syro-Malabar)   JoseKalluvelil, Bishop  Type of Jurisdiction: Eparchy | Elevated:22 December2018 | Immediately Subject to the Holy See | Syro-Malabar Catholic Church of the Chaldean Tradition | Country:Canada | Mailing Address: Syro-Malabar Apostolic Exarchate, 6630 Turner Valley Rd., Mississauga, ON L5V 2P1, Canada | Telephone: (905)858-8200 | Fax: 858-8208    https://www.catholic-hierarchy.org/diocese/dmism.html\n1   Eparchy of Mar Addai of Toronto (Chaldean)  Robert SaeedJarjis, Bishop | Bawai (Ashur)Soro, Bishop Emeritus Type of Jurisdiction: Eparchy | Erected:10 June2011 | Immediately Subject to the Holy See | Chaldean Catholic Church of the Chaldean Tradition | Country:Canada | Conference Region:Ontario | Mailing Address: 2 High Meadow Place, Toronto, ON M9L 2Z5, Canada | Telephone: (416)746-5816 | Fax: 746-5850  https://www.catholic-hierarchy.org/diocese/dtoch.html\n2   Eparchy of Saint-Sauveur de Montr\xef\xbf\xbdal (Melkite Greek)    MiladJawish, B.S., Bishop   Type of Jurisdiction: Eparchy | Elevated:1 September1984 | Immediately Subject to the Holy See | Melkite Greek Catholic Church of the Byzantine Tradition | Country:Canada | Conference Region:Quebec | Web Site:http://www.melkite.com/ | Mailing Address: 10025 boul. de l\'Arcadie, Montreal, QC H4N 2S1, Canada | Telephone: (514)272.6430 | Fax: 202.1274   https://www.catholic-hierarchy.org/diocese/dmome.html\n3   Diocese of Calgary  William TerrenceMcGrattan, Bishop | Frederick BernardHenry, Bishop Emeritus Type of Jurisdiction: Diocese | Erected:30 November1912 | Metropolitan: Archdiocese ofEdmonton | Rite: Latin (or Roman) | Province: Alberta | Country:Canada | Square Kilometers: 110,500 (42,680 Square Miles) | Conference Region:West (Ouest) | Catholic Directory Abbreviation: Cal | Official Web Site:http://www.calgarydiocese.ca/ | Mailing Address: Catholic Pastoral Centre, Room 290, The Iona Building, 120-17th Avenue S.W., Calgary, AB T2S 2T2, Canada | Telephone: (403)218-5528 | Fax: 264-0272    https://www.catholic-hierarchy.org/diocese/dcalg.html\n4   Archdiocese of Winnipeg Richard JosephGagnon, Archbishop | James VernonWeisgerber, Archbishop Emeritus  Type of Jurisdiction: Archdiocese | Erected:4 December1915 | Immediately Subject to the Holy See | Rite: Latin (or Roman) | Province: Manitoba | Country:Canada | Square Kilometers: 116,405 (44,961 Square Miles) | Conference Region:West (Ouest) | Catholic Directory Abbreviation: W | Official Web Site:http://www.archwinnipeg.ca/ | Mailing Address: Chancery Office, 1495 Pembina Highway, Winnipeg, MB R3T 2C6, Canada | Telephone: (204)452-2227 | Fax: 475-4409  https://www.catholic-hierarchy.org/diocese/dwinn.html\n... ... ... ... ...\n2619    Archiepiscopal Exarchate of Krym (Ukrainian)    Vacant | Makariy BohdanLeniv, O.S.B.M., Apostolic Administrator | MykhayloBubniy, C.SS.R., Archiepiscopal Administrator Type of Jurisdiction: Archiepiscopal Exarchate | Split:13 February2014 | Metropolitan: Archeparchy ofKyiv-Haly\xc4\x8d {Kiev} (Ukrainian) | Ukrainian Catholic Church of the Byzantine Tradition | Country:Ukraine | Mailing Address: vul. Schmidta 22/12, 65000 Odessa, Ukraina | Telephone: (0482)32.58.90 | Fax: 32.58.89   https://www.catholic-hierarchy.org/diocese/dkrym.html\n2620    Diocese of Lutsk    VitaliySkomarovskyi, Bishop | MarkijanTrofym\xe2\x80\x99yak, Bishop Emeritus   Type of Jurisdiction: Diocese | Split:28 October1925 | Metropolitan: Archdiocese ofLviv | Rite: Latin (or Roman) | Country:Ukraine | Square Kilometers: 40,190 (15,523 Square Miles) | Official Web Site:http://catholic.volyn.ua/ | Mailing Address: Kuria Diecezjalna, vul. Katedralna 17, 43016 Lutsk, Ukraina | Telephone: (0332)72.15.32 | Fax: (same) https://www.catholic-hierarchy.org/diocese/dluts.html\n2621    Diocese of Stockholm    AndersArborelius, O.C.D., Cardinal, Bishop  Type of Jurisdiction: Diocese | Elevated:29 June1953 | Immediately Subject to the Holy See | Rite: Latin (or Roman) | Country:Sweden | Square Kilometers: 450,295 (173,926 Square Miles) | Official Web Site:https://www.katolskakyrkan.se | Mailing Address: Katolska Biskopsambetet, Gotgatan 68, P.O. Box 4114, S-102 62 Stockholm, Sverige | Telephone: (08)462.66.02 | Fax: 702.05.55  https://www.catholic-hierarchy.org/diocese/dstos.html\n2622    Archeparchy of Diarbekir (Amida) (Chaldean) RamziGarmou, Ist. del Prado, Archbishop Type of Jurisdiction: Archeparchy | Elevated:3 January1966 | Chaldean Catholic Church of the Chaldean Tradition | Country:Turkey | Mailing Address: Archeveche Chaldeen, Hamalbasi Caddesi 20, Galatasaray, 34435 Beyoglu, Istanbul, Turkiye | Telephone: (0212)252.34.49 | Fax: (same) https://www.catholic-hierarchy.org/diocese/ddiar.html\n2623    Eparchy of Kolomyia (Ukrainian) VasylIvasyuk, Bishop    Type of Jurisdiction: Eparchy | Split:12 September2017 | Metropolitan: Archeparchy ofIvano-Frankivsk [Stanislaviv] (Ukrainian) | Ukrainian Catholic Church of the Byzantine Tradition | Country:Ukraine | Square Kilometers: 14,000 (5,407 Square Miles) | Official Web Site:https://kolugcc.org.ua | Mailing Address: vul. Ivana Franka 29, 78200 Kolomyia, Ukraina | Telephone: (06891)19.707 https://www.catholic-hierarchy.org/diocese/dkolo.html\n2624 rows \xc3\x97 4 columns\n
Run Code Online (Sandbox Code Playgroud)\n

正如您所看到的,此代码将在大约 3 分钟内提取 2.6k 个教区的完整信息,同时使用的资源比多处理或多线程少得多。

\n

您将需要安装以下内容(安装或升级,只需在colab笔记本中一一运行这些命令即可):

\n
pip install -U asyncio\npip install -U nest-asyncio\npip install -U httpx\npip install -U bs4\npip install -U pandas\n
Run Code Online (Sandbox Code Playgroud)\n

我还导入了 re,以防万一您想要一一选择信息位(司法管辖区、传统、地址、网站等),每个信息都在 try/ except 块中,以解释丢失的信息,以及相应地扩展列表/数据框。上述所有包都可以在https://pypi.org/上找到,并有记录。

\n

  • 你好 - 非常感谢你提供了很棒的解决方案。我不知所措。我在 Colab 上没有专业帐户 - 所以我尝试在本地计算机上运行所有帐户。因此,我需要将本地机器设置为全新,并包含所有必要的东西。顺便说一句:这里有两个选择 Linux 笔记本电脑和 b. 一台带有 anaconda 的胜利机器 - 我应该选择哪一个!?再次非常感谢 - 我喜欢你的解决方案 - 太棒了 (2认同)
  • 不客气@thannen。如果我的答案解决了您的问题,请不要忘记将其标记为已接受(投票按钮下的绿色复选标记)。我会推荐您最熟悉的操作系统,然后使用 linux 创建一个虚拟机,您可以在其中尝试 python。 (2认同)
  • 你好,亲爱的@barry the Platipus - 非常感谢你所做的一切 - 太棒了,你帮了我很多。我很开心。顺便说一句,脚本有什么区别 - 如果我们只工作 2,6 K 结果!?!生成的脚本更简单一些,并且能够在 -Collab 上运行 - 我只是一个普通的 colab 用户,我在那里有一组有限的工具和插件。所以如果我也能在 Colab 中运行它,我会很高兴!期待收到您的来信-问候 (2认同)
  • 你好,亲爱的 Barry the Platipus:我在运行你的代码时遇到错误 - 我得到了这样的结果:ModuleNotFoundError Traceback(最近一次调用最后)&lt;ipython-input-1-64bb145c85bf&gt; in &lt;module&gt; ----&gt; 1 from httpx import Client ,AsyncClient,限制 2 from bs4 import BeautifulSoup as bs 3 import pandas as pd 4 import re 5 from datetime import datetime ModuleNotFoundError:没有名为“httpx”的模块 (2认同)
  • 你好,亲爱的巴里,我收到错误 - 我在协作中运行时得到了这个:-------------------------------- ------------------------------------------- ModuleNotFoundError Traceback(最近一次调用最后) &lt;ipython-input-1-64bb145c85bf&gt; in &lt;module&gt; ----&gt; 1 from httpx import Client, AsyncClient, Limits 2 from bs4 import BeautifulSoup as bs 3 import pandas as pd 4 import re 5 from datetime import datetime ModuleNotFoundError:没有名为“httpx”的模块 (2认同)