如何为针对命名实体识别的分类器形成特征向量？

Question

如何为针对命名实体识别的分类器形成特征向量？

Leg*_*end 2 language-agnostic nlp machine-learning

我有一组标签（不同于传统的名称、地点、对象等）。就我而言，它们是特定于领域的，我将它们称为：实体、操作、事件。我想使用它们作为提取更多命名实体的种子。

\n\n

我发现了这篇论文：Isozaki 等人的“ Ef\xef\xac\x81cient Support Vector Classi\xef\xac\x81ers for Named Entity Recognition ”。虽然我喜欢使用支持向量机进行命名实体识别的想法，但我一直困惑于如何对特征向量进行编码。对于他们的论文，他们是这样说的：

\n\n

\n
例如，\xe2\x80\x9cPresident George Herbert Bush said Clinton\n is 中的单词。。。\xe2\x80\x9d 的分类\xef\xac\x81ed 如下： \xe2\x80\x9cPresident\xe2\x80\x9d = OTHER, \xe2\x80\x9cGeorge\xe2\x80\x9d =\n PERSON-BEGIN , \xe2\x80\x9cHerbert\xe2\x80\x9d = 中间人, \xe2\x80\x9cBush\xe2\x80\x9d = 中间人, \xe2\x80\x9csaid\xe2\x80\x9d =\n其他，\xe2\x80\x9c克林顿\xe2\x80\x9d = 个人，\xe2\x80\x9cis\xe2\x80\x9d\n = 其他。这样，一个人名\xe2\x80\x99的\xef\xac\x81第一个单词就被标记为PERSON-BEGIN。最后一个词被标记为 PERSON-END。名称中的其他词是“PERSON-MIDDLE”。如果一个人\xe2\x80\x99的名字由单个单词表示，则它被标记为PERSON-SINGLE。如果某个单词不属于任何命名实体，则它会被标记为 OTHER。由于 IREX de-\n \xef\xac\x81nes 八个 NE 类，单词被分类为 33 个类别。
\n\n
每个样本由 15 个特征表示，因为每个单词有 3 个特征（词性标记、字符类型和单词本身），并且两个前面的单词和两个后面的单词也用于上下文依赖。尽管通常会删除不常见的特征\n以防止过度\xef\xac\x81tting，但我们使用所有特征，因为 SVM 具有鲁棒性。\n 每个样本都由一个长二进制向量表示，即 0（假）的序列\n和 1（正确）。例如，上例中的 \xe2\x80\x9cBush\xe2\x80\x9d\n 由下面描述的向量 x = x[1] ... x[D] 表示。只有\n 15 个元素是 1。
\n

\n\n

x[1] = 0 // Current word is not \xe2\x80\x98Alice\xe2\x80\x99 \nx[2] = 1 // Current word is \xe2\x80\x98Bush\xe2\x80\x99 \nx[3] = 0 // Current word is not \xe2\x80\x98Charlie\xe2\x80\x99\n\nx[15029] = 1 // Current POS is a proper noun \nx[15030] = 0 // Current POS is not a verb\n\nx[39181] = 0 // Previous word is not \xe2\x80\x98Henry\xe2\x80\x99 \nx[39182] = 1 // Previous word is \xe2\x80\x98Herbert\n

Run Code Online (Sandbox Code Playgroud)\n\n

我不太明白这里的二进制向量是如何构造的。我知道我错过了一个微妙的点，但有人可以帮助我理解这一点吗？

\n

Answer 1

Rob*_*aus 5

他们省略了一袋词库构建步骤。

基本上，您已经从训练集中的（非罕见）单词到索引构建了一个映射。假设您的训练集中有 20k 个独特的单词。您将获得从训练集中的每个单词到 [0, 20000] 的映射。

然后，特征向量基本上是几个非常稀疏的向量的串联，其中 1 对应于特定单词，19,999 个 0，然后 1 对应于特定 POS，另外 50 个 0 对应于非活动 POS。这通常称为单热编码。 http://en.wikipedia.org/wiki/One-hot

def encode_word_feature(word, POStag, char_type, word_index_mapping, POS_index_mapping, char_type_index_mapping)):
  # it makes a lot of sense to use a sparsely encoded vector rather than dense list, but it's clearer this way
  ret = empty_vec(len(word_index_mapping) + len(POS_index_mapping) + len(char_type_index_mapping))
  so_far = 0
  ret[word_index_mapping[word] + so_far] = 1
  so_far += len(word_index_mapping)
  ret[POS_index_mapping[POStag] + so_far] = 1
  so_far += len(POS_index_mapping)
  ret[char_type_index_mapping[char_type] + so_far] = 1
  return ret

def encode_context(context):
  return encode_word_feature(context.two_words_ago, context.two_pos_ago, context.two_char_types_ago, 
             word_index_mapping, context_index_mapping, char_type_index_mapping) +
         encode_word_feature(context.one_word_ago, context.one_pos_ago, context.one_char_types_ago, 
             word_index_mapping, context_index_mapping, char_type_index_mapping) + 
         # ... pattern is obvious

Run Code Online (Sandbox Code Playgroud)

因此，您的特征向量的大小约为 100k，其中 POS 和 char 标签有一点额外的大小，并且几乎完全是 0，除了根据特征到索引映射选取的位置中的 15 个 1 之外。

归档时间：	14 年，2 月前
查看次数：	1458 次
最近记录：	10 年前