什么是单词矢量表示中的UNK令牌

Question

什么是单词矢量表示中的UNK令牌

# Step 2: Build the dictionary and replace rare words with UNK token.
vocabulary_size = 50000


def build_dataset(words, n_words):
  """Process raw inputs into a dataset."""
  count = [['UNK', -1]]
  count.extend(collections.Counter(words).most_common(n_words - 1))
  dictionary = dict()
  for word, _ in count:
    dictionary[word] = len(dictionary)
  data = list()
  unk_count = 0
  for word in words:
    if word in dictionary:
      index = dictionary[word]
    else:
      index = 0  # dictionary['UNK']
      unk_count += 1
    data.append(index)
  count[0][1] = unk_count
  reversed_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
  return data, count, dictionary, reversed_dictionary

data, count, dictionary, reverse_dictionary = build_dataset(vocabulary,
                                                            vocabulary_size)

Run Code Online (Sandbox Code Playgroud)

我正在学习使用Tensorflow的单词向量表示的基本示例.

这个步骤2标题为"构建字典并用UNK标记替换罕见的单词",但是,没有事先定义"UNK"所指的过程.

要指定问题:

0)UNK在NLP中通常提到什么？

1)count = [['UNK',-1]]是什么意思？我知道括号[]引用python中的列表,但是,为什么我们将它与-1并置？

Answer 1

Pey*_*man 8

正如评论中已经提到的，在标记化和 NLP 中，当您看到标记时UNK，它可能表示未知单词。

例如，如果您想预测句子中缺失的单词。您将如何向其提供数据？你肯定需要一个标记来显示丢失的单词在哪里。因此，如果“house”是我们缺失的单词，那么在标记化之后它会像：

'my house is big'->['my', 'UNK', 'is', 'big']

PS：这count = [['UNK', -1]]是为了初始化count，就像[['word', number_of_occurences]]Ivan Aksamentov 已经说过的那样。

归档时间：	8 年，6 月前
查看次数：	4113 次
最近记录：	6 年，6 月前