使用sklearn TfidfVectorizer和已经标记化的输入？

Question

使用sklearn TfidfVectorizer和已经标记化的输入？

gre*_*123 8 scikit-learn tfidfvectorizer

我有一个标记化的句子列表,并希望适合一个tfidf矢量化器.我尝试了以下方法:

tokenized_list_of_sentences = [['this', 'is', 'one'], ['this', 'is', 'another']]

def identity_tokenizer(text):
  return text

tfidf = TfidfVectorizer(tokenizer=identity_tokenizer, stop_words='english')    
tfidf.fit_transform(tokenized_list_of_sentences)

Run Code Online (Sandbox Code Playgroud)

出错的地方

AttributeError: 'list' object has no attribute 'lower'

Run Code Online (Sandbox Code Playgroud)

有没有办法做到这一点？我有十亿句话,不想再次对它们进行标记.在此之前的另一个阶段,它们被标记化.

Answer 1

pml*_*mlk 10

尝试TfidfVectorizer使用参数初始化对象lowercase=False（假设实际上是需要的，因为在上一阶段中将令牌小写了）。

tokenized_list_of_sentences = [['this', 'is', 'one', 'basketball'], ['this', 'is', 'a', 'football']]

def identity_tokenizer(text):
    return text

tfidf = TfidfVectorizer(tokenizer=identity_tokenizer, stop_words='english', lowercase=False)    
tfidf.fit_transform(tokenized_list_of_sentences)

Run Code Online (Sandbox Code Playgroud)

请注意，我更改了这些句子，因为它们显然仅包含停用词，由于空词表而导致了另一个错误。

归档时间：	7 年，9 月前
查看次数：	5104 次
最近记录：	7 年，3 月前