我面临以下问题.我有一个系统能够根据他们的异常分数产生一些操作的排名.为了提高性能,我实现了遗传算法来执行特征选择,使得最异常的操作出现在第一个位置.我正在做的不是特征选择,因为我没有使用二进制变量,而是在0-1之间浮动变量,其中sum等于1.
目前,我有50代人口200人.我使用系统本身作为评估函数,我使用真正的正率评估解决方案的质量,计算前N个位置出现的异常操作数量(其中N是异常操作的数量).然后作为操作员的均匀交叉,我改变个体细胞的值以进行突变.当然,每次我做一个检查以确定个人的总和是1.最后,我使用精英主义来保存当时最好的解决方案.
我观察到一个特征具有非常高的值,这通常很重要,但并非总是如此,这导致其他特征的值非常低.我怀疑我的GA过度拟合.你能帮我找到一个好的停止标准吗?
zeg*_*jan 13
Overfitting in genetic algorithms and programming is a big issue which is currently under research focus of the GP community, including myself. Most of the research is aimed at genetic programming and evolution of classification/regression models but it might also relate to your problem. There are some papers which might help you (and which I am working with too):
You can find the papers (the first two directly in pdf) by searching for their titles in scholar.google.com.
Basically, what all the papers work with, is the idea of using only a subset of the training data for directing the evolution and (randomly) changing this subset every generation (using the same subset for all individuals in one generation). Interestingly, experiments show that the smaller this subset is, the less overfitting occurs, up to the extreme of using only a single-element subset. The papers work with this idea and extend it with some tweaks (like switching between full dataset and a subset). But as I said in the beginning, all this is aimed at symbolic regression (more or less) and not feature selection.
我个人曾尝试过另一种方法(再次通过遗传编程进行符号回归) - 使用训练数据的子集(例如一半)来推动进化(即适应性),但使用"最佳 - 迄今为止"的解决方案是使用剩余训练数据的结果.过度拟合的重要性要小得多.
| 归档时间: |
|
| 查看次数: |
2991 次 |
| 最近记录: |