将令牌传递给CountVectorizer

Question

将令牌传递给CountVectorizer

我有一个文本分类问题，其中有两种功能：

n克要素（由CountVectorizer提取）
其他文字特征（例如，来自给定词典的单词的存在）。这些功能与n-gram不同，因为它们应该是从文本中提取的任何n-gram的一部分。

两种类型的特征均从文本的标记中提取。我只想运行一次令牌化，然后将这些令牌传递给CountVectorizer和其他存在功能提取器。因此，我想将标记列表传递给CountVectorizer，但只接受字符串作为某些样本的表示。有没有办法传递令牌数组？

Answer 1

vla*_*kha 6

总结@ user126350和@miroli的答案以及此链接：

from sklearn.feature_extraction.text import CountVectorizer

def dummy(doc):
    return doc

cv = CountVectorizer(
    tokenizer=dummy,
    preprocessor=dummy,
)  

docs = [
    ['hello', 'world', '.'],
    ['hello', 'world'],
    ['again', 'hello', 'world']
]

cv.fit(docs)
cv.get_feature_names()
# ['.', 'again', 'hello', 'world']

Run Code Online (Sandbox Code Playgroud)

要记住的一件事是，在调用transform（）函数之前，将新的标记化文档包装到列表中，以便将其作为单个文档处理，而不是将每个标记解释为文档：

new_doc = ['again', 'hello', 'world', '.']
v_1 = cv.transform(new_doc)
v_2 = cv.transform([new_doc])

v_1.shape
# (4, 4)

v_2.shape
# (1, 4)

Run Code Online (Sandbox Code Playgroud)

Answer 2

use*_*350 3

一般来说，您可以将自定义tokenizer参数传递给CountVectorizer. 标记生成器应该是一个接受字符串并返回其标记数组的函数。但是，如果您已经将标记放入数组中，则可以简单地使用某个任意键创建标记数组的字典，然后让标记生成器从该字典返回。然后，当您运行 CountVectorizer 时，只需传递字典键即可。例如，

 # arbitrary token arrays and their keys
 custom_tokens = {"hello world": ["here", "is", "world"],
                  "it is possible": ["yes it", "is"]}

 CV = CountVectorizer(
      # so we can pass it strings
      input='content',
      # turn off preprocessing of strings to avoid corrupting our keys
      lowercase=False,
      preprocessor=lambda x: x,
      # use our token dictionary
      tokenizer=lambda key: custom_tokens[key])

 CV.fit(custom_tokens.keys())

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年，10 月前
查看次数：	5100 次
最近记录：	6 年，8 月前