max_df对应的文档比Ridge分类器中的min_df错误更大

ath*_*_nn 6 machine-learning tf-idf mongodb

我用大量数据训练了岭分类器,这些数据用于tfidf vecotrizer矢量化数据,并且过去工作良好。但是现在我面临一个错误

'max_df对应于<min_df个文档'

数据存储在Mongodb中。
我尝试了各种解决方案,最后,当我在Mongodb中删除了只有1个文档(1条记录)的集合时,它正常工作并照常完成了培训。

但是我需要一个不需要删除记录的解决方案,因为我需要该记录。

另外,我不理解该错误,因为它仅在我的机器中运行。即使该记录存在于db中,该脚本也可以在我的系统中正常运行,该脚本在其他系统中也可以正常运行。

有人可以帮忙吗?

Lui*_*ano 7

That error is telling you that your max_df value is less than the min_df value. For example:

max_df = 0.7 # Removes terms with DF higher than the 70% of the documents

min_df = 5 # Terms must have DF >= 5 to be considered
Run Code Online (Sandbox Code Playgroud)

and suppose that the total number of documents in your corpus is 7, so max_df now is 0.7*7 = 4.9 and min_df still is 5, then max_df < min_df, and that should never happen because that means that 0 terms will be considered; never a term has DF lower than 4.9 and higher than 5.

  • 检查是否传递整数值作为参数,整数值被解释为绝对计数。如果您想避免出现在 100% 文档中的术语,则必须传递 1.0(浮点型)作为参数。 (2认同)