怎样才能刮得更快

Tek*_*ath 17 python mongodb web-scraping pymongo

这里的工作是从一个站点中抓取一个 API https://xxx.xxx.xxx/xxx/1.jsonhttps://xxx.xxx.xxx/xxx/1417749.json并将其准确地写入 mongodb。为此,我有以下代码:

client = pymongo.MongoClient("mongodb://127.0.0.1:27017")
db = client["thread1"]
com = db["threadcol"]
start_time = time.time()
write_log = open("logging.log", "a")
min = 1
max = 1417749
for n in range(min, max):
    response = requests.get("https:/xx.xxx.xxx/{}.json".format(str(n)))
    if response.status_code == 200:
        parsed = json.loads(response.text)
        inserted = com.insert_one(parsed)
        write_log.write(str(n) + "\t" + str(inserted) + "\n")
        print(str(n) + "\t" + str(inserted) + "\n")
write_log.close()
Run Code Online (Sandbox Code Playgroud)

但是完成这项任务需要花费大量时间。这里的问题是我怎样才能加快这个过程。

kei*_*fly 10

您可以执行以下几项操作:

  1. 重用连接。根据下面的基准,它大约快 3 倍
  2. 您可以并行抓取多个进程

来自这里的并行代码

from threading import Thread
from Queue import Queue
q = Queue(concurrent * 2)
for i in range(concurrent):
    t = Thread(target=doWork)
    t.daemon = True
    t.start()
try:
    for url in open('urllist.txt'):
        q.put(url.strip())
    q.join()
except KeyboardInterrupt:
    sys.exit(1)
Run Code Online (Sandbox Code Playgroud)

从时序这一问题的可重用连接

>>> timeit.timeit('_ = requests.get("https://www.wikipedia.org")', 'import requests', number=100)
Starting new HTTPS connection (1): www.wikipedia.org
Starting new HTTPS connection (1): www.wikipedia.org
Starting new HTTPS connection (1): www.wikipedia.org
...
Starting new HTTPS connection (1): www.wikipedia.org
Starting new HTTPS connection (1): www.wikipedia.org
Starting new HTTPS connection (1): www.wikipedia.org
52.74904417991638
>>> timeit.timeit('_ = session.get("https://www.wikipedia.org")', 'import requests; session = requests.Session()', number=100)
Starting new HTTPS connection (1): www.wikipedia.org
15.770191192626953
Run Code Online (Sandbox Code Playgroud)


alb*_*tro 6

您可以在两个方面改进您的代码:

  • 使用 a Session,这样连接就不会在每次请求时重新安排并保持打开状态;

  • 在您的代码中使用并行性asyncio

看看这里https://pawelmhm.github.io/asyncio/python/aiohttp/2016/04/22/asyncio-aiohttp.html

  • 您可以添加更多细节吗? (2认同)

小智 5

如果您不想使用多线程,asyncio 也是一个解决方案

import time
import pymongo
import json
import asyncio
from aiohttp import ClientSession


async def get_url(url, session):
    async with session.get(url) as response:
        if response.status == 200:
            return await response.text()


async def create_task(sem, url, session):
    async with sem:
        response = await get_url(url, session)
        if response:
            parsed = json.loads(response)
            n = url.rsplit('/', 1)[1]
            inserted = com.insert_one(parsed)
            write_log.write(str(n) + "\t" + str(inserted) + "\n")
            print(str(n) + "\t" + str(inserted) + "\n")


async def run(minimum, maximum):
    url = 'https:/xx.xxx.xxx/{}.json'
    tasks = []
    sem = asyncio.Semaphore(1000)   # Maximize the concurrent sessions to 1000, stay below the max open sockets allowed
    async with ClientSession() as session:
        for n in range(minimum, maximum):
            task = asyncio.ensure_future(create_task(sem, url.format(n), session))
            tasks.append(task)
        responses = asyncio.gather(*tasks)
        await responses


client = pymongo.MongoClient("mongodb://127.0.0.1:27017")
db = client["thread1"]
com = db["threadcol"]
start_time = time.time()
write_log = open("logging.log", "a")
min_item = 1
max_item = 100

loop = asyncio.get_event_loop()
future = asyncio.ensure_future(run(min_item, max_item))
loop.run_until_complete(future)
write_log.close()
Run Code Online (Sandbox Code Playgroud)