使用Tensorflow和预先训练的FastText来嵌入看不见的单词

Mun*_*ong 7 embedding tensorflow fasttext

我正在使用预先训练的快速文本模型https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md).

我使用Gensim加载fasttext模型.它可以输出任何单词的向量,无论它是被看到还是看不见(词汇外).

from gensim.models.wrappers import FastText
en_model = FastText.load_fasttext_format('../wiki.en/wiki.en')
print(en_model['car'])
print(en_model['carcaryou'])
Run Code Online (Sandbox Code Playgroud)

在张量流中,我知道我可以使用下面的代码来获得所见单词的可训练嵌入:

# Embedding layer
embeddings = tf.get_variable('embedding_matrix', [vocab_size, state_size], Trainable=True)
rnn_inputs = tf.nn.embedding_lookup(embeddings, x)
Run Code Online (Sandbox Code Playgroud)

已知单词的索引很容易获得.然而,对于那些看不见的词,FastText基于子词模式"预测"它们的潜在向量.看不见的单词没有任何索引.

在这种情况下,我应该如何使用tensorflow来处理已知单词和使用fasttext的看不见的单词?

ted*_*ted 3

我找到了一个解决方法,使用tf.py_func

\n\n
def lookup(arr):\n    global model\n    global decode\n\n    decoded_arr = decode(arr)\n    new_arr = np.zeros((*arr.shape, 300))\n    for s, sent in enumerate(decoded_arr):\n        for w, word in enumerate(sent):\n            try:\n                new_arr[s, w] = model.wv[word]\n            except Exception as e:\n                print(e)\n                new_arr[s, w] = np.zeros(300)\n    return new_arr.astype(np.float32)\n\nz = tf.py_func(lookup, [x], tf.float32, stateful=True, name=None)\n
Run Code Online (Sandbox Code Playgroud)\n\n

这段代码有效,(使用法语,抱歉,但没关系)

\n\n
import tensorflow as tf\nimport numpy as np\nfrom gensim.models.wrappers import FastText\n\nmodel = FastText.load_fasttext_format("../../Tracfin/dev/han/data/embeddings/cc.fr.300.bin")\ndecode = np.vectorize(lambda x: x.decode("utf-8"))\n\ndef lookup(arr):\n    global model\n    global decode\n\n    decoded_arr = decode(arr)\n    new_arr = np.zeros((*arr.shape, 300))\n    for s, sent in enumerate(decoded_arr):\n        for w, word in enumerate(sent):\n            try:\n                new_arr[s, w] = model.wv[word]\n            except Exception as e:\n                print(e)\n                new_arr[s, w] = np.zeros(300)\n    return new_arr.astype(np.float32)\n\ndef extract_words(token):\n    # Split characters\n    out = tf.string_split([token], delimiter=" ")\n    # Convert to Dense tensor, filling with default value\n    out = tf.reshape(tf.sparse_tensor_to_dense(out, default_value="<pad>"), [-1])\n    return out\n\n\ntextfile = "text.txt"\nwords = [\n    "ceci est un texte hexabromocyclodod\xc3\xa9canes int\xc3\xa9ressant qui mentionne des",\n    "mots connus et des mots inconnus commeceluici ou celui-l\xc3\xa0 polybromobiph\xc3\xa9nyle",\n]\n\nwith open(textfile, "w") as f:\n    f.write("\\n".join(words))\n\ntf.reset_default_graph()\npadded_shapes = tf.TensorShape([None])\npadding_values = "<pad>"\n\ndataset = tf.data.TextLineDataset(textfile)\ndataset = dataset.map(extract_words, 2)\ndataset = dataset.shuffle(10000, reshuffle_each_iteration=True)\ndataset = dataset.repeat()\ndataset = dataset.padded_batch(3, padded_shapes, padding_values)\niterator = tf.data.Iterator.from_structure(\n    dataset.output_types, dataset.output_shapes\n)\ndataset_init_op = iterator.make_initializer(dataset, name="dataset_init_op")\nx = iterator.get_next()\nz = tf.py_func(lookup, [x], tf.float32, stateful=True, name=None)\nsess = tf.InteractiveSession()\nsess.run(dataset_init_op)\ny, w = sess.run([x, z])\ny = decode(y)\n\nprint(\n    "\\nWords out of vocabulary: ",\n    np.sum(1 for word in y.reshape(-1) if word not in model.wv.vocab),\n)\nprint("Lookup worked: ", all(model.wv[y[0][0][0]] == w[0][0][0]))\n
Run Code Online (Sandbox Code Playgroud)\n\n

印刷:

\n\n
Words out of vocabulary:  6\nLookup worked:  True\n
Run Code Online (Sandbox Code Playgroud)\n\n

我没有尝试优化东西,特别是查找循环,欢迎评论

\n