如何避免过度拟合遗传算法

use*_*037 6 genetic-algorithm

我面临以下问题.我有一个系统能够根据他们的异常分数产生一些操作的排名.为了提高性能,我实现了遗传算法来执行特征选择,使得最异常的操作出现在第一个位置.我正在做的不是特征选择,因为我没有使用二进制变量,而是在0-1之间浮动变量,其中sum等于1.

目前,我有50代人口200人.我使用系统本身作为评估函数,我使用真正的正率评估解决方案的质量,计算前N个位置出现的异常操作数量(其中N是异常操作的数量).然后作为操作员的均匀交叉,我改变个体细胞的值以进行突变.当然,每次我做一个检查以确定个人的总和是1.最后,我使用精英主义来保存当时最好的解决方案.

我观察到一个特征具有非常高的值,这通常很重要,但并非总是如此,这导致其他特征的值非常低.我怀疑我的GA过度拟合.你能帮我找到一个好的停止标准吗?

zeg*_*jan 13

Overfitting in genetic algorithms and programming is a big issue which is currently under research focus of the GP community, including myself. Most of the research is aimed at genetic programming and evolution of classification/regression models but it might also relate to your problem. There are some papers which might help you (and which I am working with too):

  • Gonçalves, Ivo, and Sara Silva. "Experiments on controlling overfitting in genetic programming." Proceedings of the 15th Portuguese Conference on Artificial Intelligence: Progress in Artificial Intelligence, EPIA. Vol. 84. 2011.
  • Langdon, W. B. "Minimising testing in genetic programming." RN 11.10 (2011): 1.
  • Gonçalves, Ivo, et al. "Random sampling technique for overfitting control in genetic programming." Genetic Programming. Springer Berlin Heidelberg, 2012. 218-229.
  • Gonçalves, Ivo, and Sara Silva. Balancing learning and overfitting in genetic programming with interleaved sampling of training data. Springer Berlin Heidelberg, 2013.

You can find the papers (the first two directly in pdf) by searching for their titles in scholar.google.com.

Basically, what all the papers work with, is the idea of using only a subset of the training data for directing the evolution and (randomly) changing this subset every generation (using the same subset for all individuals in one generation). Interestingly, experiments show that the smaller this subset is, the less overfitting occurs, up to the extreme of using only a single-element subset. The papers work with this idea and extend it with some tweaks (like switching between full dataset and a subset). But as I said in the beginning, all this is aimed at symbolic regression (more or less) and not feature selection.

我个人曾尝试过另一种方法(再次通过遗传编程进行符号回归) - 使用训练数据的子集(例如一半)来推动进化(即适应性),但使用"最佳 - 迄今为止"的解决方案是使用剩余训练数据的结果.过度拟合的重要性要小得多.

  • 尽管您拒绝对您的问题/方法的任何微小细节做出更具体的说明,但Honza尽管得到了应有的尊重,但我仍然帮助您引导您的注意力.尊重你的决定不再透露任何关于GP/EP算法,健身功能,还是关于受约束/不受约束的一代类型的任何进一步细节的任何进一步细节,**似乎相当不礼貌和不公平的问题为了获得额外的帮助和建议,您先前的人拒绝透露更多有关您的问题以及之前的解决方案** (2认同)
  • @zegkljan 感谢您的出色回答!您是否进一步开发了答案的最后一部分中描述的方法?有没有关于这个的出版物? (2认同)