我想使用spacy和NLP进行在线服务.每次用户发出请求时,我都会调用脚本"my_script.py"
从以下开始:
from spacy.en import English
nlp = English()
Run Code Online (Sandbox Code Playgroud)
我遇到的问题是这两行占用时间超过10秒,是否可以将英语()保留在ram或其他一些选项中以将此加载时间减少到不到一秒?
你说你想在my_script.py每次请求时都启动一个独立的脚本().这将使用功能,spacy.en而不需要加载的开销spacy.en.使用此方法,操作系统将始终在启动脚本时创建新进程.因此,spacy.en每次只能避免加载一种方法:拥有一个已经运行的已spacy.en加载的单独进程,并让您的脚本与该进程通信.下面的代码显示了一种方法.但是,正如其他人所说,您可能会因为更改服务器体系结构而受益,因此spacy.en会在Web服务器中加载(例如,使用基于Python的Web服务器).
最常见的进程间通信形式是通过TCP/IP套接字.下面的代码实现了一个小型服务器,它保持spacy.en加载并处理来自客户端的请求.它还有一个客户端,它将请求发送到该服务器并接收结果.由您决定将哪些内容放入这些传输中.
还有第三个脚本.由于客户端和服务器都需要发送和接收功能,因此这些功能都在一个名为的共享脚本中comm.py.(请注意,客户端和服务器各自加载一个单独的副本comm.py;它们不通过加载到共享内存中的单个模块进行通信.)
我假设这两个脚本都在同一台机器上运行.如果没有,您将需要comm.py在两台计算机上放置一份副本并更改comm.server_host为服务器的计算机名称或IP地址.
nlp_server.py作为后台进程运行(或仅在不同的终端窗口中进行测试).这会等待请求,处理它们并将结果发回:
import comm
import socket
from spacy.en import English
nlp = English()
def process_connection(sock):
print "processing transmission from client..."
# receive data from the client
data = comm.receive_data(sock)
# do something with the data
result = {"data received": data}
# send the result back to the client
comm.send_data(result, sock)
# close the socket with this particular client
sock.close()
print "finished processing transmission from client..."
server_sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
# open socket even if it was used recently (e.g., server restart)
server_sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
server_sock.bind((comm.server_host, comm.server_port))
# queue up to 5 connections
server_sock.listen(5)
print "listening on port {}...".format(comm.server_port)
try:
while True:
# accept connections from clients
(client_sock, address) = server_sock.accept()
# process this connection
# (this could be launched in a separate thread or process)
process_connection(client_sock)
except KeyboardInterrupt:
print "Server process terminated."
finally:
server_sock.close()
Run Code Online (Sandbox Code Playgroud)
加载my_script.py为快速运行的脚本以从nlp服务器请求结果(例如python my_script.py here are some arguments):
import socket, sys
import comm
# data can be whatever you want (even just sys.argv)
data = sys.argv
print "sending to server:"
print data
# send data to the server and receive a result
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
# disable Nagle algorithm (probably only needed over a network)
sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, True)
sock.connect((comm.server_host, comm.server_port))
comm.send_data(data, sock)
result = comm.receive_data(sock)
sock.close()
# do something with the result...
print "result from server:"
print result
Run Code Online (Sandbox Code Playgroud)
comm.py 包含客户端和服务器使用的代码:
import sys, struct
import cPickle as pickle
# pick a port that is not used by any other process
server_port = 17001
server_host = '127.0.0.1' # localhost
message_size = 8192
# code to use with struct.pack to convert transmission size (int)
# to a byte string
header_pack_code = '>I'
# number of bytes used to represent size of each transmission
# (corresponds to header_pack_code)
header_size = 4
def send_data(data_object, sock):
# serialize the data so it can be sent through a socket
data_string = pickle.dumps(data_object, -1)
data_len = len(data_string)
# send a header showing the length, packed into 4 bytes
sock.sendall(struct.pack(header_pack_code, data_len))
# send the data
sock.sendall(data_string)
def receive_data(sock):
""" Receive a transmission via a socket, and convert it back into a binary object. """
# This runs as a loop because the message may be broken into arbitrary-size chunks.
# This assumes each transmission starts with a 4-byte binary header showing the size of the transmission.
# See https://docs.python.org/3/howto/sockets.html
# and http://code.activestate.com/recipes/408859-socketrecv-three-ways-to-turn-it-into-recvall/
header_data = ''
header_done = False
# set dummy values to start the loop
received_len = 0
transmission_size = sys.maxint
while received_len < transmission_size:
sock_data = sock.recv(message_size)
if not header_done:
# still receiving header info
header_data += sock_data
if len(header_data) >= header_size:
header_done = True
# split the already-received data between header and body
messages = [header_data[header_size:]]
received_len = len(messages[0])
header_data = header_data[:header_size]
# find actual size of transmission
transmission_size = struct.unpack(header_pack_code, header_data)[0]
else:
# already receiving data
received_len += len(sock_data)
messages.append(sock_data)
# combine messages into a single string
data_string = ''.join(messages)
# convert to an object
data_object = pickle.loads(data_string)
return data_object
Run Code Online (Sandbox Code Playgroud)
注意:您应该确保从服务器发送的结果仅使用本机数据结构(dicts,lists,strings等).如果结果包含一个定义的对象spacy.en,则客户端将spacy.en在解包结果时自动导入,以便提供对象的方法.
此设置非常类似于HTTP协议(服务器等待连接,客户端连接,客户端发送请求,服务器发送响应,双方断开连接).因此,您可能最好使用标准HTTP服务器和客户端而不是此自定义代码.这将是一个"RESTful API",这是一个流行的术语(有充分的理由).使用标准HTTP包可以省去管理自己的客户端/服务器代码的麻烦,甚至可以直接从现有的Web服务器调用数据处理服务器而不是启动my_script.py.但是,您必须将您的请求转换为与HTTP兼容的内容,例如GET或POST请求,或者只是特殊格式的URL.
另一种选择是使用标准的进程间通信包,如PyZMQ,redis,mpi4py或zmq_object_exchanger.请参阅此问题以获得一些想法:高效的Python IPC
或者您可以spacy.en使用dill软件包(https://pypi.python.org/pypi/dill)将对象的副本保存在磁盘上,然后在开始时将其恢复my_script.py.这可能比每次导入/重建它更快,并且比使用进程间通信更简单.
您的目标应该是仅初始化 spacy 模型一次。使用 class ,并将 spacy 作为类属性。无论何时使用它,它都将是该属性的同一个实例。
from spacy.en import English
class Spacy():
nlp = English()
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
2450 次 |
| 最近记录: |