默认的 sklearn TfidfVectorizer 预处理器是做什么的？

Question

默认的 sklearn TfidfVectorizer 预处理器是做什么的？

Lia*_*ris 5 python machine-learning scikit-learn

我正在查看 sklearn 的TfidfVectorizer，特别是preprocessor具有以下文档的输入参数：

“覆盖预处理（字符串转换）阶段，同时保留标记化和 n-gram 生成步骤。”

我试图弄清楚当我不覆盖它时预处理阶段到底会做什么（如果有的话）？

我有一个实验，我使用以下代码查看生成的稀疏矩阵中存储元素的数量：

vectorizer = TfidfVectorizer(stop_words=words, preprocessor=process, ngram_range=(1,1), strip_accents='unicode')
vect = vectorizer.fit_transform(twenty_train.data)
items_stored = vect.nnz

Run Code Online (Sandbox Code Playgroud)

当我不覆盖预处理器时，生成的矩阵存储 1278323 个元素。
当我使用空方法重写预处理器时，生成的矩阵存储 1441372 个元素。
当我使用包含的方法重写预处理器时s = re.sub("[^a-zA-Z]", " ", s)，生成的矩阵存储 1331597 个元素。
我无法通过任何其他处理步骤影响稀疏矩阵的大小（或用于分类时的准确性）。

显然，与默认的 sklearn 结果存在差异，没有预处理，并且我尝试复制预处理步骤。我正在努力寻找有关预处理器默认情况下具体执行的操作的文档。

我还检查了-但是TfidfVectorizer我也无法从这里弄清楚预处理器在做什么。

有谁知道 sklearn 的默认预处理器执行了哪些代码或采取了哪些预处理步骤？

Answer 1

m33*_*33n 1

您在找这个吗？

def build_preprocessor(self):
    """Return a function to preprocess the text before tokenization"""
    if self.preprocessor is not None:
        return self.preprocessor

    # unfortunately python functools package does not have an efficient
    # `compose` function that would have allowed us to chain a dynamic
    # number of functions. However the cost of a lambda call is a few
    # hundreds of nanoseconds which is negligible when compared to the
    # cost of tokenizing a string of 1000 chars for instance.
    noop = lambda x: x

    # accent stripping
    if not self.strip_accents:
        strip_accents = noop
    elif callable(self.strip_accents):
        strip_accents = self.strip_accents
    elif self.strip_accents == 'ascii':
        strip_accents = strip_accents_ascii
    elif self.strip_accents == 'unicode':
        strip_accents = strip_accents_unicode
    else:
        raise ValueError('Invalid value for "strip_accents": %s' %
                         self.strip_accents)

    if self.lowercase:
        return lambda x: strip_accents(x.lower())
    else:
        return strip_accents

Run Code Online (Sandbox Code Playgroud)

从这里： https: //github.com/scikit-learn/scikit-learn/blob/bac89c2/sklearn/feature_extraction/text.py#L230

归档时间：	7 年，1 月前
查看次数：	2295 次
最近记录：	7 年，1 月前