Lia*_*ris 5 python machine-learning scikit-learn
我正在查看 sklearn 的TfidfVectorizer,特别是preprocessor
具有以下文档的输入参数:
“覆盖预处理(字符串转换)阶段,同时保留标记化和 n-gram 生成步骤。”
我试图弄清楚当我不覆盖它时预处理阶段到底会做什么(如果有的话)?
我有一个实验,我使用以下代码查看生成的稀疏矩阵中存储元素的数量:
vectorizer = TfidfVectorizer(stop_words=words, preprocessor=process, ngram_range=(1,1), strip_accents='unicode')
vect = vectorizer.fit_transform(twenty_train.data)
items_stored = vect.nnz
Run Code Online (Sandbox Code Playgroud)
s = re.sub("[^a-zA-Z]", " ", s)
,生成的矩阵存储 1331597 个元素。显然,与默认的 sklearn 结果存在差异,没有预处理,并且我尝试复制预处理步骤。我正在努力寻找有关预处理器默认情况下具体执行的操作的文档。
我还检查了-但是TfidfVectorizer
我也无法从这里弄清楚预处理器在做什么。
有谁知道 sklearn 的默认预处理器执行了哪些代码或采取了哪些预处理步骤?
您在找这个吗?
def build_preprocessor(self):
"""Return a function to preprocess the text before tokenization"""
if self.preprocessor is not None:
return self.preprocessor
# unfortunately python functools package does not have an efficient
# `compose` function that would have allowed us to chain a dynamic
# number of functions. However the cost of a lambda call is a few
# hundreds of nanoseconds which is negligible when compared to the
# cost of tokenizing a string of 1000 chars for instance.
noop = lambda x: x
# accent stripping
if not self.strip_accents:
strip_accents = noop
elif callable(self.strip_accents):
strip_accents = self.strip_accents
elif self.strip_accents == 'ascii':
strip_accents = strip_accents_ascii
elif self.strip_accents == 'unicode':
strip_accents = strip_accents_unicode
else:
raise ValueError('Invalid value for "strip_accents": %s' %
self.strip_accents)
if self.lowercase:
return lambda x: strip_accents(x.lower())
else:
return strip_accents
Run Code Online (Sandbox Code Playgroud)
从这里: https: //github.com/scikit-learn/scikit-learn/blob/bac89c2/sklearn/feature_extraction/text.py#L230
归档时间: |
|
查看次数: |
2295 次 |
最近记录: |