Python，sklearn，it-idf 如何按“####”分割，默认空格

Question

Python，sklearn，it-idf 如何按“####”分割，默认空格

Yao*_*ian 2 python split tf-idf scikit-learn

使用sklean tf-idf，默认使用空间分割

corpus = [  
'This is the first document.',  
'This is the second second document.',  
'And the third one.',  
'Is this the first document?'
]    

vectorizer = CountVectorizer()   
X = vectorizer.fit_transform(corpus)

Run Code Online (Sandbox Code Playgroud)

但是，我想使用这种形式：

enter code herecorpus = [  
'This####is####the####first####document.',  
'This####is####the####second####second####document.'
]
vectorizer = CountVectorizer()   
X = vectorizer.fit_transform(corpus)
tfidf=transformer.fit_transform(vectorizer.fit_transform(documents))
word=vectorizer.get_feature_names()
weight=tfidf.toarray()

Run Code Online (Sandbox Code Playgroud)

怎么做？

Answer 1

sna*_*ile 5

使用自定义标记器：

def four_pounds_tokenizer(s):
   return s.split('####')

vectorizer = CountVectorizer(tokenizer=four_pounds_tokenizer)
X = vectorizer.fit_transform(corpus)

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，11 月前
查看次数：	1101 次
最近记录：	7 年，7 月前