使用 BeautifulSoup 抓取具有不变 URL 的多个页面

Question

使用 BeautifulSoup 抓取具有不变 URL 的多个页面

Sha*_*lam 3 python beautifulsoup web-scraping python-3.x infinite-scroll

我正在使用Beautiful Soup从非英语网站中提取数据。现在我的代码只从关键字搜索中提取前十个结果。该网站旨在通过“更多”按钮访问其他结果（有点像无限滚动，但您必须继续点击更多才能获得下一组结果）。当我点击“更多”时，URL 不会改变，所以我不能每次都迭代不同的 URL。

我真的很想在两件事上得到一些帮助。

修改下面的代码，以便我可以从所有页面获取数据，而不仅仅是前 10 个结果
插入计时器功能，以便服务器不会阻止我

我正在添加“更多”按钮外观的照片，因为它不是英文的。它在页面末尾以蓝色文本显示。

import requests, csv, os
from bs4 import BeautifulSoup
from time import strftime, sleep

# make a GET request (requests.get("URL")) and store the response in a response object (req)
responsePA = requests.get('https://www.prothomalo.com/search?q=%E0%A6%A7%E0%A6%B0%E0%A7%8D%E0%A6%B7%E0%A6%A3')

# read the content of the server’s response
rawPagePA = responsePA.text

soupPA = BeautifulSoup(rawPagePA)
# take a look
print (soupPA.prettify())

urlsPA = [] #creating empty list to store URLs
for item in soupPA.find_all("div", class_= "customStoryCard9-m__story-data__2qgWb"): #first part of loop selects all items with class=field-title
    aTag = item.find("a") #extracting elements containing 'a' tags
    urlsPA.append(aTag.attrs["href"]) 

print(urlsPA) 

#Below I'm getting the data from each of the urls and storing them in a list
PAlist=[]
for link in urlsPA:
    specificpagePA=requests.get(link) #making a get request and stores the response in an object
    rawAddPagePA=specificpagePA.text # read the content of the server’s response
    PASoup2=BeautifulSoup(rawAddPagePA) # parse the response into an HTML tree
    PAcontent=PASoup2.find_all(class_=["story-element story-element-text", "time-social-share-wrapper storyPageMetaData-m__time-social-share-wrapper__2-RAX", "headline headline-type-9 story-headline bn-story-headline headline-m__headline__3vaq9 headline-m__headline-type-9__3gT8S", "contributor-name contributor-m__contributor-name__1-593"]) 
    #print(PAcontent)
    PAlist.append(PAcontent)

Run Code Online (Sandbox Code Playgroud)

Answer 1

Sir*_*rst 6

你实际上并不需要硒。

按钮发送以下 GET 请求：

https://www.prothomalo.com/api/v1/advanced-search?fields=headline,subheadline,slug,url,hero-image-s3-key,hero-image-caption,hero-image-metadata,first-published-at,last-published-at,alternative,published-at,authors,author-name,author-id,sections,story-template,metadata,tags,cards&offset=10&limit=6&q=?????

Run Code Online (Sandbox Code Playgroud)

重要的部分是最后的“ offset=10&limit=6 ”，随后点击按钮只会将该偏移量增加 6。

得到

来自所有页面的数据

行不通，因为似乎有很多，而且我没有看到确定多少的选项。所以你最好选择一个号码并请求，直到你有那么多链接。

由于此请求返回 JSON，因此您最好只对其进行解析，而不是将 HTML 提供给 BeautifulSoup。

看看那个：

import requests
import json

s = requests.Session()
term = '?????'
count = 20

# Make GET-Request
r = s.get(
    'https://www.prothomalo.com/api/v1/advanced-search',
    params={
        'offset': 0,
        'limit': count,
        'q': term
    }
)

# Read response text (a JSON file)
info = json.loads(r.text)

# Loop over items
urls = [item['url'] for item in info['items']]

print(urls)

Run Code Online (Sandbox Code Playgroud)

这将返回以下列表：

['https://www.prothomalo.com/world/asia/??????????-?????????-?????-????-???????-??????-??????-??????????', 'https://www.prothomalo.com/bangladesh/district/?????-??????-???-????-?????-???????????-????????-????????-?????', 'https://www.prothomalo.com/bangladesh/district/?????????-?????-?-?????-??????-??????-?????????-????????', 'https://www.prothomalo.com/bangladesh/district/????????-?????-??????-????-?????????', 'https://www.prothomalo.com/bangladesh/?????????-??-?????-???', 'https://www.prothomalo.com/bangladesh/district/??-?????-??????-?????-??????-????-?????????', 'https://www.prothomalo.com/bangladesh/district/????-???????-?????-????-??????-???-???-????????-?????-?????????-?', 'https://www.prothomalo.com/bangladesh/district/???????-???-??????-?????-??????-???????-???????????-??????-?????????', 'https://www.prothomalo.com/bangladesh/district/??????-??????-?????-??????-?????-???????-?????-?????', 'https://www.prothomalo.com/world/india/?????-????-?????????-???-??????-??????-?????????-???????-????', 'https://www.prothomalo.com/bangladesh/district/?????????-?????-??????-????????-????-??????????-???????', 'https://www.prothomalo.com/bangladesh/district/?????-?????-??????-????????-????-?????????', 'https://www.prothomalo.com/bangladesh/district/??????????-????????-?????-?????-????-?-???????-???????-?????????-?', 'https://www.prothomalo.com/bangladesh/district/?????-?????-????-???????-???-???-?????????????-?????', 'https://www.prothomalo.com/opinion/column/?????-??????-??????????-???????', 'https://www.prothomalo.com/world/asia/?????????????-?????-?????-??????????????-?????-?????????', 'https://www.prothomalo.com/bangladesh/district/?????-???????-???-???-????????-?????-??????-???', 'https://www.prothomalo.com/bangladesh/district/?????-?????????-????-???-?????-?????-?????-??????-?????', 'https://www.prothomalo.com/bangladesh/district/????????-??-?????-?????-?-?????-????-?????????-?', 'https://www.prothomalo.com/bangladesh/district/?????-??????-??????-?????????????-?????????']

Run Code Online (Sandbox Code Playgroud)

通过调整计数，您可以设置要检索的网址（文章）的数量，term是搜索词。

该requests.Session用于-object有一致的饼干。

如果您有任何问题随时问。

编辑：

以防万一您想知道我是如何通过单击按钮发现发送的是哪个GET请求的：我从浏览器 (Firefox) 的开发人员工具中转到网络分析选项卡，单击该按钮，观察正在发送的请求发送并复制该 URL：
.get -function 中params参数的另一种解释：它包含（以 python-dictionary-format 格式）通常会附加到 URL 后问号后的所有参数。所以

requests.get('https://www.prothomalo.com/search?q=%E0%A6%A7%E0%A6%B0%E0%A7%8D%E0%A6%B7%E0%A6%A3')
Run Code Online (Sandbox Code Playgroud)
可以写成

requests.get('https://www.prothomalo.com/search', params={'q': '?????'})
Run Code Online (Sandbox Code Playgroud)
这使它看起来更好看，而且您实际上可以看到您正在搜索的内容，因为它是用 unicode 编写的，并且尚未针对 URL 进行编码。

编辑：
如果脚本开始返回一个空的 JSON 文件，因此没有 URL，您可能必须像这样设置一个用户代理（我在 Firefox 中使用了一个，但任何浏览器都应该没问题）：

s.headers.update({ 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:87.0) ' 'Gecko/20100101 Firefox/87.0' })
Run Code Online (Sandbox Code Playgroud)
只需将该代码放在初始化会话对象的s = ...行（该行）下方。
一个用户代理告诉什么样的程序正在访问他们的数据的网站。

永远记住，服务器还有其他事情要做，网页除了向一个人发送数千个搜索结果之外还有其他优先事项，所以尽量保持低流量。抓取 5000 个 URL 是很多的，如果你真的必须多次这样做，sleep(...)在你发出下一个请求之前至少在任何地方放置几秒钟（不仅仅是为了防止被阻止，而是为了对提供您提供您要求的信息）。
你把睡眠放在哪里并不重要，因为你真正与服务器联系的唯一时间是s.get(...)线路。

归档时间：	4 年，7 月前
查看次数：	188 次
最近记录：	4 年，7 月前