Python从文件中解析字符串到浮点列表的高效方法

Vic*_*zzi 0 python performance parsing

这个文件每行有一个单词和成千上万的浮点数,我想将它转换为一个字典,其中单词为key,向量为所有浮点数.这就是我正在做的事情,但是由于文件的大小(大约20k行,每行约10k值),这个过程花费的时间太长了.我找不到更有效的解析方法.只是一些不能保证减少运行时间的替代方法.

with open("googlenews.word2vec.300d.txt") as g_file:
  i = 0;
  #dict of words: [lots of floats]
  google_words = {}

  for line in g_file:
    google_words[line.split()[0]] = [float(line.split()[i]) for i in range(1, len(line.split()))]
Run Code Online (Sandbox Code Playgroud)

And*_*zko 5

在你的解决方案中,你line.split()为每个单词做两次预制.考虑以下修改:

with open("googlenews.word2vec.300d.txt") as g_file:
    i = 0;
    #dict of words: [lots of floats]
    google_words = {}

    for line in g_file:
        word, *numbers = line.split()
        google_words[word] = [float(number) for number in numbers]
Run Code Online (Sandbox Code Playgroud)

我在这里使用的一个高级概念是"拆包": word, *numbers = line.split()

Python允许将可迭代值解包为多个变量:

a, b, c = [1, 2, 3]
# This is practically equivalent to
a = 1
b = 2
c = 3
Run Code Online (Sandbox Code Playgroud)

*是"获取剩余物,将它们放入并将list列表分配给名称" 的快捷方式:

a, *rest = [1, 2, 3, 4]
# results in
a == 1
rest == [2, 3, 4]
Run Code Online (Sandbox Code Playgroud)