SMOTE初始化期望n_neighbors <= n_samples，但n_samples <n_neighbors

Question

SMOTE初始化期望n_neighbors <= n_samples，但n_samples <n_neighbors

Dbe*_*les 2 tf-idf knn scikit-learn oversampling imblearn

我已经预先清理了数据，下面显示了前4行的格式：

     [IN] df.head()

    [OUT]   Year    cleaned
         0  1909    acquaint hous receiv follow letter clerk crown...
         1  1909    ask secretari state war whether issu statement...
         2  1909    i beg present petit sign upward motor car driv...
         3  1909    i desir ask secretari state war second lieuten...
         4  1909    ask secretari state war whether would introduc...

Run Code Online (Sandbox Code Playgroud)

我已将train_test_split（）称为如下：

     [IN] X_train, X_test, y_train, y_test = train_test_split(df['cleaned'], df['Year'], random_state=2)
   [Note*] `X_train` and `y_train` are now Pandas.core.series.Series of shape (1785,) and `X_test` and `y_test` are also Pandas.core.series.Series of shape (595,)

Run Code Online (Sandbox Code Playgroud)

然后，我使用以下TfidfVectorizer和fit / transform过程矢量化了X训练和测试数据：

     [IN] v = TfidfVectorizer(decode_error='replace', encoding='utf-8', stop_words='english', ngram_range=(1, 1), sublinear_tf=True)
          X_train = v.fit_transform(X_train)
          X_test = v.transform(X_test)

Run Code Online (Sandbox Code Playgroud)

我现在处于通常应用分类器等的阶段（如果这是一组平衡的数据）。但是，我初始化了imblearn的SMOTE （）类（以执行过度采样）...

     [IN] smote_pipeline = make_pipeline_imb(SMOTE(), classifier(random_state=42))
          smote_model = smote_pipeline.fit(X_train, y_train)
          smote_prediction = smote_model.predict(X_test)

Run Code Online (Sandbox Code Playgroud)

...但这导致：

     [OUT] ValueError: "Expected n_neighbors <= n_samples, but n_samples = 5, n_neighbors = 6.

Run Code Online (Sandbox Code Playgroud)

我试图减少n_neighbors的数量，但无济于事，任何提示或建议将不胜感激。谢谢阅读。

-------------------------------------------------- -------------------------------------------------- --------------------------------

编辑：

完整回溯

数据集/数据框（df）包含两列2380行，如上所示df.head()。X_train包含以字符串列表（df['cleaned']）y_train格式的1785行，还包含以字符串（df['Year']）格式的1785行。

使用后的矢量TfidfVectorizer()：X_train和X_test从变换pandas.core.series.Series形状的'（1785）'和'（595）'分别向scipy.sparse.csr.csr_matrix分别形状'（1785，126459）'和'（595，126459）'的。

至于类的数量：使用Counter()，我计算出有199个类（年），每个类的实例都附加到上述df['cleaned']数据的一个元素上，该元素包含从文本语料库中提取的字符串列表。

此过程的目标是根据语音提示自动确定/猜测输入文本数据的年，十年或世纪（可以进行任何分类！）。

Answer 1

小智 8

由于训练集中大约有200个班级和1800个样本，因此每个班级平均有9个样本。出现错误消息的原因是：a）数据可能不完全平衡，并且某些类的样本数少于6，并且b）邻居数为6。针对您的问题的一些解决方案：

计算199个类别中的最小样本数（n_samples），并选择SMOTE类别的n_neighbors参数小于或等于n_samples。
排除使用class ratio参数对n_samples <n_neighbors个类进行过采样的情况SMOTE。
使用RandomOverSampler没有类似限制的类。
结合3和4解决方案：创建使用SMOTE且RandomOversampler满足条件的流水线类的条件n_neighbors <= n_samples 的管道，并在不满足条件时使用随机过采样。

@Dbercules：嗨，你能指导我吗，你是如何制作管道的？我试过 `sm = SMOTE(random_state=42)` `rm = RandomOverSampler(random_state=42)` `my_pipe = make_pipeline(sm, rm)` `X_res, Y_res = my_pipe.fit_resample(X, y)` 但得到了错误, 与标题问题相同 (2认同)

Answer 2

Rem*_*ish 5

尝试为 SMOTE 执行以下代码

oversampler=SMOTE(kind='regular',k_neighbors=2)

这对我有用。

我收到此错误 `TypeError: __init__() 获得意外的关键字参数 'kind' ` (2认同)

归档时间：	8 年，2 月前
查看次数：	4977 次
最近记录：	6 年，9 月前

SMOTE初始化期望n_neighbors &lt;= n_samples，但n_samples &lt;n_neighbors

SMOTE初始化期望n_neighbors <= n_samples，但n_samples <n_neighbors