Ser*_*ero 5 screen-scraping python-requests dask dask-delayed
我喜欢 dask 的简单性,并且喜欢用它来抓取当地的超市。我的 multiprocessing.cpu_count() 是 4,但这段代码仅实现了 2 倍的加速。为什么?
from bs4 import BeautifulSoup
import dask, requests, time
import pandas as pd
base_url = 'https://www.lider.cl/supermercado/category/Despensa/?No={}&isNavRequest=Yes&Nrpp=40&page={}'
def scrape(id):
page = id+1; start = 40*page
bs = BeautifulSoup(requests.get(base_url.format(start,page)).text,'lxml')
prods = [prod.text for prod in bs.find_all('span',attrs={'class':'product-description js-ellipsis'})]
prods = [prod.text for prod in prods]
brands = [b.text for b in bs.find_all('span',attrs={'class':'product-name'})]
sdf = pd.DataFrame({'product': prods, 'brand': brands})
return sdf
data = [dask.delayed(scrape)(id) for id in range(10)]
df = dask.delayed(pd.concat)(data)
df = df.compute()
Run Code Online (Sandbox Code Playgroud)
首先,速度提升 2 倍 - 万岁!
您需要首先阅读http://dask.pydata.org/en/latest/setup/single-machine.html
简而言之,以下三件事可能很重要:
concat操作发生在单个任务中,因此无法并行,并且对于某些数据类型可能会占总时间的很大一部分。您还可以将所有最终数据绘制到客户的流程中.compute()。| 归档时间: |
|
| 查看次数: |
1530 次 |
| 最近记录: |