使用 Python 3.7 中的 Beautifulsoup 从 WSJ 抓取网页文章？

Question

使用 Python 3.7 中的 Beautifulsoup 从 WSJ 抓取网页文章？

Piy*_*iya 2 python beautifulsoup web-scraping

我正在尝试使用 Python 中的 Beautifulsoup 从华尔街日报中抓取文章。但是，我正在运行的代码执行没有任何错误（退出代码 0）但没有结果。我不明白发生了什么？为什么这段代码没有给出预期的结果。

我什至支付了订阅费。

我知道有些地方不对，但我找不到问题所在。

import time

import requests

from bs4 import BeautifulSoup

url = 'https://www.wsj.com/search/term.html?KEYWORDS=cybersecurity&min-date=2018/04/01&max-date=2019/03/31' \
  '&isAdvanced=true&daysback=90d&andor=AND&sort=date-desc&source=wsjarticle,wsjpro&page={}'

pages = 32
for page in range(1, pages+1):
    res = requests.get(url.format(page))
    soup = BeautifulSoup(res.text,"lxml")
    for item in soup.select(".items.hedSumm li > a"):
        resp = requests.get(item.get("href"))
        _href = item.get("href")

        try:
            resp = requests.get(_href)
        except Exception as e:
            try:
            resp = requests.get("https://www.wsj.com" + _href)
        except Exception as e:
            continue
    sauce = BeautifulSoup(resp.text,"lxml")
    date = sauce.select("time.timestamp.article__timestamp.flexbox__flex--1")
    date = date[0].text
    tag = sauce.select("li.article-breadCrumb span").text
    title = sauce.select_one("h1.wsj-article-headline").text
    content = [elem.text for elem in sauce.select("p.article-content")]
    print(f'{date}\n {tag}\n {title}\n {content}\n')

    time.sleep(3)

Run Code Online (Sandbox Code Playgroud)

正如我在代码中所写，我试图抓取所有文章的日期、标题、标签和内容。如果我能得到关于我的错误的建议，我应该怎么做才能得到想要的结果，那将会很有帮助。

Answer 1

bha*_*atk 5

替换您的代码：

resp = requests.get(item.get("href"))

Run Code Online (Sandbox Code Playgroud)

到：

_href = item.get("href")

try:
    resp = requests.get(_href)
except Exception as e:
    try:
        resp = requests.get("https://www.wsj.com"+_href)
    except Exception as e:
        continue

Run Code Online (Sandbox Code Playgroud)

因为大多数item.get("href")都没有提供正确的网站网址，例如您正在获得这样的网址。

/news/types/national-security
/public/page/news-financial-markets-stock.html
https://www.wsj.com/news/world

Run Code Online (Sandbox Code Playgroud)

只有https://www.wsj.com/news/world一个有效的网站 URL。所以你需要concatebase URL用_href。

更新：

import time
import requests
from bs4 import BeautifulSoup
from bs4.element import Tag

url = 'https://www.wsj.com/search/term.html?KEYWORDS=cybersecurity&min-date=2018/04/01&max-date=2019/03/31' \
  '&isAdvanced=true&daysback=90d&andor=AND&sort=date-desc&source=wsjarticle,wsjpro&page={}'

pages = 32

for page in range(1, pages+1):
    res = requests.get(url.format(page))
    soup = BeautifulSoup(res.text,"lxml")

    for item in soup.find_all("a",{"class":"headline-image"},href=True):
        _href = item.get("href")
        try:
            resp = requests.get(_href)
        except Exception as e:
            try:
                resp = requests.get("https://www.wsj.com"+_href)
            except Exception as e:
                continue

        sauce = BeautifulSoup(resp.text,"lxml")
        dateTag = sauce.find("time",{"class":"timestamp article__timestamp flexbox__flex--1"})
        tag = sauce.find("li",{"class":"article-breadCrumb"})
        titleTag = sauce.find("h1",{"class":"wsj-article-headline"})
        contentTag = sauce.find("div",{"class":"wsj-snippet-body"})

        date = None
        tagName = None
        title = None
        content = None

        if isinstance(dateTag,Tag):
            date = dateTag.get_text().strip()

        if isinstance(tag,Tag):
            tagName = tag.get_text().strip()

        if isinstance(titleTag,Tag):
            title = titleTag.get_text().strip()

        if isinstance(contentTag,Tag):
            content = contentTag.get_text().strip()

        print(f'{date}\n {tagName}\n {title}\n {content}\n')
        time.sleep(3)

Run Code Online (Sandbox Code Playgroud)

开/关：

March 31, 2019 10:00 a.m. ET
 Tech
 Care.com Removes Tens of Thousands of Unverified Listings
 The online child-care marketplace Care.com scrubbed its site of tens of thousands of unverified day-care center listings just before a Wall Street Journal investigation published March 8, an analysis shows. Care.com, the largest site in the U.S. for finding caregivers, removed about 72% of day-care centers, or about 46,594 businesses, listed on its site, a Journal review of the website shows. Those businesses were listed on the site as recently as March 1....

Updated March 29, 2019 6:08 p.m. ET
 Politics
 FBI, Retooling Once Again, Sets Sights on Expanding Cyber Threats
 The FBI has launched its biggest transformation since the 2001 terror attacks to retrain and refocus special agents to combat cyber criminals, whose threats to lives, property and critical infrastructure have outstripped U.S. efforts to thwart them. The push comes as federal investigators grapple with an expanding range of cyber attacks sponsored by foreign adversaries against businesses or national interests, including Russian election interference and Chinese cyber thefts from American companies, senior bureau executives...

Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年，9 月前
查看次数：	1671 次
最近记录：	5 年，1 月前