我需要抓取并获取许多(每天 5-10k)新闻文章的正文段落的原始文本。我已经编写了一些线程代码,但考虑到这个项目的高度 I/O 限制性质,我正在涉足asyncio. 下面的代码片段并不比 1 线程版本快,而且比我的线程版本差得多。谁能告诉我我做错了什么?谢谢你!
async def fetch(session,url):
async with session.get(url) as response:
return await response.text()
async def scrape_urls(urls):
results = []
tasks = []
async with aiohttp.ClientSession() as session:
for url in urls:
html = await fetch(session,url)
soup = BeautifulSoup(html,'html.parser')
body = soup.find('div', attrs={'class':'entry-content'})
paras = [normalize('NFKD',para.get_text()) for para in body.find_all('p')]
results.append(paras)
return results
Run Code Online (Sandbox Code Playgroud)