是否可以将spacy保留在内存中以减少加载时间?

Lui*_*uez 5 python nlp spacy

我想使用spacy和NLP进行在线服务.每次用户发出请求时,我都会调用脚本"my_script.py"

从以下开始:

from spacy.en import English
nlp = English()
Run Code Online (Sandbox Code Playgroud)

我遇到的问题是这两行占用时间超过10秒,是否可以将英语()保留在ram或其他一些选项中以将此加载时间减少到不到一秒?

Mat*_*ipp 6

你说你想在my_script.py每次请求时都启动一个独立的脚本().这将使用功能,spacy.en而不需要加载的开销spacy.en.使用此方法,操作系统将始终在启动脚本时创建新进程.因此,spacy.en每次只能避免加载一种方法:拥有一个已经运行的已spacy.en加载的单独进程,并让您的脚本与该进程通信.下面的代码显示了一种方法.但是,正如其他人所说,您可能会因为更改服务器体系结构而受益,因此spacy.en会在Web服务器中加载(例如,使用基于Python的Web服务器).

最常见的进程间通信形式是通过TCP/IP套接字.下面的代码实现了一个小型服务器,它保持spacy.en加载并处理来自客户端的请求.它还有一个客户端,它将请求发送到该服务器并接收结果.由您决定将哪些内容放入这些传输中.

还有第三个脚本.由于客户端和服务器都需要发送和接收功能,因此这些功能都在一个名为的共享脚本中comm.py.(请注意,客户端和服务器各自加载一个单独的副本comm.py;它们不通过加载到共享内存中的单个模块进行通信.)

我假设这两个脚本都在同一台机器上运行.如果没有,您将需要comm.py在两台计算机上放置一份副本并更改comm.server_host为服务器的计算机名称或IP地址.

nlp_server.py作为后台进程运行(或仅在不同的终端窗口中进行测试).这会等待请求,处理它们并将结果发回:

import comm
import socket
from spacy.en import English
nlp = English()

def process_connection(sock):
    print "processing transmission from client..."
    # receive data from the client
    data = comm.receive_data(sock)
    # do something with the data
    result = {"data received": data}
    # send the result back to the client
    comm.send_data(result, sock)
    # close the socket with this particular client
    sock.close()
    print "finished processing transmission from client..."

server_sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
# open socket even if it was used recently (e.g., server restart)
server_sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
server_sock.bind((comm.server_host, comm.server_port))
# queue up to 5 connections
server_sock.listen(5)
print "listening on port {}...".format(comm.server_port)
try:
    while True:
        # accept connections from clients
        (client_sock, address) = server_sock.accept()
        # process this connection 
        # (this could be launched in a separate thread or process)
        process_connection(client_sock)
except KeyboardInterrupt:
    print "Server process terminated."
finally:
    server_sock.close()
Run Code Online (Sandbox Code Playgroud)

加载my_script.py为快速运行的脚本以从nlp服务器请求结果(例如python my_script.py here are some arguments):

import socket, sys
import comm

# data can be whatever you want (even just sys.argv)
data = sys.argv

print "sending to server:"
print data

# send data to the server and receive a result
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
# disable Nagle algorithm (probably only needed over a network) 
sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, True)
sock.connect((comm.server_host, comm.server_port))
comm.send_data(data, sock)
result = comm.receive_data(sock)
sock.close()

# do something with the result...
print "result from server:"
print result
Run Code Online (Sandbox Code Playgroud)

comm.py 包含客户端和服务器使用的代码:

import sys, struct
import cPickle as pickle

# pick a port that is not used by any other process
server_port = 17001
server_host = '127.0.0.1' # localhost
message_size = 8192
# code to use with struct.pack to convert transmission size (int) 
# to a byte string
header_pack_code = '>I'
# number of bytes used to represent size of each transmission
# (corresponds to header_pack_code)
header_size = 4  

def send_data(data_object, sock):
    # serialize the data so it can be sent through a socket
    data_string = pickle.dumps(data_object, -1)
    data_len = len(data_string)
    # send a header showing the length, packed into 4 bytes
    sock.sendall(struct.pack(header_pack_code, data_len))
    # send the data
    sock.sendall(data_string)

def receive_data(sock):
    """ Receive a transmission via a socket, and convert it back into a binary object. """
    # This runs as a loop because the message may be broken into arbitrary-size chunks.
    # This assumes each transmission starts with a 4-byte binary header showing the size of the transmission.
    # See https://docs.python.org/3/howto/sockets.html
    # and http://code.activestate.com/recipes/408859-socketrecv-three-ways-to-turn-it-into-recvall/

    header_data = ''
    header_done = False
    # set dummy values to start the loop
    received_len = 0
    transmission_size = sys.maxint

    while received_len < transmission_size:
        sock_data = sock.recv(message_size)
        if not header_done:
            # still receiving header info
            header_data += sock_data
            if len(header_data) >= header_size:
                header_done = True
                # split the already-received data between header and body
                messages = [header_data[header_size:]]
                received_len = len(messages[0])
                header_data = header_data[:header_size]
                # find actual size of transmission
                transmission_size = struct.unpack(header_pack_code, header_data)[0]
        else:
            # already receiving data
            received_len += len(sock_data)
            messages.append(sock_data)

    # combine messages into a single string
    data_string = ''.join(messages)
    # convert to an object
    data_object = pickle.loads(data_string)
    return data_object
Run Code Online (Sandbox Code Playgroud)

注意:您应该确保从服务器发送的结果仅使用本机数据结构(dicts,lists,strings等).如果结果包含一个定义的对象spacy.en,则客户端将spacy.en在解包结果时自动导入,以便提供对象的方法.

此设置非常类似于HTTP协议(服务器等待连接,客户端连接,客户端发送请求,服务器发送响应,双方断开连接).因此,您可能最好使用标准HTTP服务器和客户端而不是此自定义代码.这将是一个"RESTful API",这是一个流行的术语(有充分的理由).使用标准HTTP包可以省去管理自己的客户端/服务器代码的麻烦,甚至可以直接从现有的Web服务器调用数据处理服务器而不是启动my_script.py.但是,您必须将您的请求转换为与HTTP兼容的内容,例如GET或POST请求,或者只是特殊格式的URL.

另一种选择是使用标准的进程间通信包,如PyZMQ,redis,mpi4py或zmq_object_exchanger.请参阅此问题以获得一些想法:高效的Python IPC

或者您可以spacy.en使用dill软件包(https://pypi.python.org/pypi/dill)将对象的副本保存在磁盘上,然后在开始时将其恢复my_script.py.这可能比每次导入/重建它更快,并且比使用进程间通信更简单.


Dhr*_*hak 4

您的目标应该是仅初始化 spacy 模型一次。使用 class ,并将 spacy 作为类属性。无论何时使用它,它都将是该属性的同一个实例。

from spacy.en import English

class Spacy():
      nlp = English()
Run Code Online (Sandbox Code Playgroud)