使用 one_hot 类对文本进行 keras 预处理

Question

使用 one_hot 类对文本进行 keras 预处理

我在在线学习 keras 时遇到了这段代码。

from keras.preprocessing.text import one_hot
from keras.preprocessing.text import text_to_word_sequence

text = 'One hot encoding in Keras'
tokens = text_to_word_sequence(text)
length = len(tokens)
one_hot(text, length)

Run Code Online (Sandbox Code Playgroud)

这会返回这样的整数......

[3,1,1,2,3]

我不明白为什么以及如何唯一的单词返回重复的数字。例如，即使文本中的单词是唯一的，3 和 1 也会重复。

Answer 1

Kri*_*R89 7

从它的文档中one_hot描述了它是如何包装的hashing_trick：

hashing_trick这是使用 hash 作为散列函数的函数的包装器；不保证单词到索引映射的唯一性。

来自以下文档hasing_trick：

由于散列函数可能发生冲突，两个或多个单词可能会被分配给同一索引。碰撞的概率与哈希空间的维度和不同对象的数量有关。

由于使用了散列，因此不同的单词有可能被散列到相同的索引。非唯一散列的概率与所选词汇大小成正比。Jason Brownlee Jason Brownlee建议使用比单词大小大 25% 的词汇大小，以增加哈希值的唯一性。

在您的案例中遵循 Jason Brownlee 的建议会导致：

from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.text import text_to_word_sequence
from tensorflow.random import set_random_seed
import math

set_random_seed(1)
text = 'One hot encoding in Keras'
tokens = text_to_word_sequence(text)
length = len(tokens)
print(one_hot(text, math.ceil(length*1.25)))

Run Code Online (Sandbox Code Playgroud)

返回整数

[3,4,5,1,6]

归档时间：	6 年，5 月前
查看次数：	2437 次
最近记录：	6 年，5 月前