Python Gensim如何通过多处理使WMD相似性运行得更快

jxn*_*jxn 7 python multithreading multiprocessing gensim

我想更快地运行gensim WMD相似性.通常,这是文档中的内容:示例语料库:

    my_corpus = ["Human machine interface for lab abc computer applications",
>>>              "A survey of user opinion of computer system response time",
>>>              "The EPS user interface management system",
>>>              "System and human system engineering testing of EPS",
>>>              "Relation of user perceived response time to error measurement",
>>>              "The generation of random binary unordered trees",
>>>              "The intersection graph of paths in trees",
>>>              "Graph minors IV Widths of trees and well quasi ordering",
>>>              "Graph minors A survey"]

my_query = 'Human and artificial intelligence software programs'
my_tokenized_query =['human','artificial','intelligence','software','programs']

model = a trained word2Vec model on about 100,000 documents similar to my_corpus.
model = Word2Vec.load(word2vec_model)
Run Code Online (Sandbox Code Playgroud)
from gensim import Word2Vec
from gensim.similarities import WmdSimilarity

def init_instance(my_corpus,model,num_best):
    instance = WmdSimilarity(my_corpus, model,num_best = 1)
    return instance
instance[my_tokenized_query]
Run Code Online (Sandbox Code Playgroud)

最匹配的文件"Human machine interface for lab abc computer applications"是伟大的.

但是上述功能instance需要很长时间.所以我想把语料库分成N几部分,然后WMD对每一部分做num_best = 1,然后在最后,最高分的部分将是最相似的.

    from multiprocessing import Process, Queue ,Manager

    def main( my_query,global_jobs,process_tmp):
        process_query = gensim.utils.simple_preprocess(my_query)

        def worker(num,process_query,return_dict):  
            instance=init_instance\
(my_corpus[num*chunk+1:num*chunk+chunk], model,1)
            x = instance[process_query][0][0]
            y = instance[process_query][0][1]
            return_dict[x] = y
        manager = Manager()
        return_dict = manager.dict()

        for num in range(num_workers):
            process_tmp = Process(target=worker, args=(num,process_query,return_dict))
            global_jobs.append(process_tmp)
            process_tmp.start()
        for proc in global_jobs:
            proc.join()

        return_dict = dict(return_dict)
        ind = max(return_dict.iteritems(), key=operator.itemgetter(1))[0]
        print corpus[ind]
        >>> "Graph minors A survey"
Run Code Online (Sandbox Code Playgroud)

我遇到的问题是,即使它输出了一些东西,它也不能从我的语料库中给出一个很好的类似查询,即使它获得了所有部分的最大相似性.

难道我做错了什么?

sto*_*vfl 5

评论:chunk是一个静态变量:例如chunk = 600 ...

如果定义chunk静态,则必须进行计算num_workers.

10001 / 600 = 16,6683333333 = 17 num_workers
Run Code Online (Sandbox Code Playgroud)

这是常见的使用没有更多的processcores你.
如果你有17 cores,那没关系.

cores 是静态的,因此你应该:

num_workers = os.cpu_count()
chunk = chunksize(my_corpus, num_workers)
Run Code Online (Sandbox Code Playgroud)
  1. 结果不一样,改为:

    #process_query = gensim.utils.simple_preprocess(my_query)
    process_query = my_tokenized_query
    
    Run Code Online (Sandbox Code Playgroud)
  2. 所有worker结果索引0..n.
    因此,return_dict[x]可以从具有较低值的相同索引的最后一个工作程序覆盖.在return_dict的指数是一样的指数my_corpus.变成:

    #return_dict[x] = y
    return_dict[ (num * chunk)+x ] = y
    
    Run Code Online (Sandbox Code Playgroud)
  3. 使用+1的块大小计算,将跳过第一个文件.
    我不知道你如何计算chunk,考虑这个例子:

    def chunksize(iterable, num_workers):
        c_size, extra = divmod(len(iterable), num_workers)
        if extra:
            c_size += 1
        if len(iterable) == 0:
            c_size = 0
        return c_size
    
    #Usage
    chunk = chunksize(my_corpus, num_workers)
    ...
    #my_corpus_chunk = my_corpus[num*chunk+1:num*chunk+chunk]
    my_corpus_chunk = my_corpus[num * chunk:(num+1) * chunk]
    
    Run Code Online (Sandbox Code Playgroud)

结果:10个周期,Tuple =(索引工人数= 0,索引工人数= 1)

multiprocessing,具有chunk=5:
02,09:(3,8),01,03:(3,5):
系统和EPS的人体系统工程测试
04,06,07:(0,8),05,08:(0 ,5),10:(0,7):
用于实验室abc计算机应用的人机界面

没有multiprocessing,有chunk=5:
01:(3,6),02:(3,5),05,08,10:(3,7),07,09:(3,8):
EPS的系统和人体系统工程测试
03,04,06:(0,5):
用于实验室abc计算机应用的人机界面

没有multiprocessing,没有分块:
01,02,03,04,06,07,08:(3,-1):
EPS
05,09,10的系统和人体系统工程测试:(0,-1):
人机界面用于实验室abc计算机应用

用Python测试:3.4.2