小编aja*_*ahu的帖子

Gensim 的 Doc2Vec most_similar 文档结果集中是否有限制？

我一直在试验 doc2vec 模块一段时间了。我可以训练我的模型并让训练后的模型输出给定文档的类似文档，如下所示：

import re
modelloaded=Doc2Vec.load("model_all_doc_dm_1")

st = 'long description of a document as string'
doc = re.sub('[^a-zA-Z]', ' ', st).lower().split() 

new_doc_vec = modelloaded.infer_vector(doc)

modelloaded.docvecs.most_similar([new_doc_vec])

Run Code Online (Sandbox Code Playgroud)

这很有效，并给了我 10 个结果。有没有办法获得超过 10 个结果，或者这是限制？

python-3.x gensim

aja*_*ahu

lucky-day

5
推荐指数

1
解决办法

667
查看次数

如何加快numpy数组和非常大的矩阵之间的余弦相似度？

我有一个问题，需要cosine similarities在形状 (1, 300) 的 numpy 数组和形状 (5000000, 300) 的矩阵之间进行计算。我尝试了多种不同风格的代码，现在我想知道是否有办法大幅减少运行时间：

版本 1：我将我的大矩阵分成 5 个大小为 1Mil 的较小矩阵：

from scipy import spatial
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor

def cos_matrix_multiplication(vector,matrix_1):

    v = vector.reshape(1, -1)
    scores1=spatial.distance.cdist(matrix_1, v, 'cosine')

    return((scores1[:1]))

pool = ThreadPoolExecutor(8)


URLS=[mat_small1,mat_small2,mat_small3,mat_small4,mat_small5]

neighbors=[]
with concurrent.futures.ThreadPoolExecutor(max_workers=30) as executor:
    # Start the load operations and mark each future with its URL
    future_to_url = {executor.submit(cos_matrix_multiplication,vec,mat_col): mat_col for mat_col in URLS}
    for future in concurrent.futures.as_completed(future_to_url):
        url = future_to_url[future]
        data = future.result()
        neighbors.append(data)

Run Code Online (Sandbox Code Playgroud)

运行时间：2.48 秒

版本 …

python cuda gpu cosine-similarity numba

aja*_*ahu

2017 12-06

5
推荐指数

1
解决办法

1828
查看次数

如何使用python在文件中查找最常出现的单词对集合？

我有一个数据集如下:

"485","AlterNet","Statistics","Estimation","Narnia","Two and half men"
"717","I like Sheen", "Narnia", "Statistics", "Estimation"
"633","MachineLearning","AI","I like Cars, but I also like bikes"
"717","I like Sheen","MachineLearning", "regression", "AI"
"136","MachineLearning","AI","TopGear"

Run Code Online (Sandbox Code Playgroud)

等等

我想找出最常出现的单词对,例如

(Statistics,Estimation:2)
(Statistics,Narnia:2)
(Narnia,Statistics)
(MachineLearning,AI:3)

Run Code Online (Sandbox Code Playgroud)

这两个词可以是任何顺序,也可以是彼此之间的任何距离

有人可以在python中提出可能的解决方案吗？这是一个非常大的数据集.

任何建议都非常感谢

所以这是我在@ 275365的建议后尝试的

@ 275365我尝试从文件中读取输入以下内容

    def collect_pairs(file):
        pair_counter = Counter()
        for line in open(file):
            unique_tokens = sorted(set(line))  
            combos = combinations(unique_tokens, 2)
            pair_counter += Counter(combos)
            print pair_counter

    file = ('myfileComb.txt')
    p=collect_pairs(file)

Run Code Online (Sandbox Code Playgroud)

文本文件与原始文件具有相同的行数,但在特定行中只有唯一的标记.我不知道我做错了什么,因为当我运行它时,它会将字母分成字母,而不是将输出作为单词的组合.当我运行此文件时,它会输出拆分字母而不是预期的单词组合.我不知道我在哪里弄错了.

python word-count python-2.7

aja*_*ahu

2014 02-13

4
推荐指数

1
解决办法

4504
查看次数