Weka的PCA运行时间太长

Question

Weka的PCA运行时间太长

ami*_*mit 9 java algorithm machine-learning weka

我正在尝试使用Weka进行使用PCA算法的特征选择.

我的原始特征空间包含~9000个属性,在2700个样本中.
我尝试使用以下代码减少数据的维度:

AttributeSelection selector = new AttributeSelection();
PrincipalComponents pca = new PrincipalComponents();
Ranker ranker = new Ranker();
selector.setEvaluator(pca);
selector.setSearch(ranker);
Instances instances = SamplesManager.asWekaInstances(trainSet);
try { 
    selector.SelectAttributes(instances);
    return SamplesManager.asSamplesList(selector.reduceDimensionality(instances));
} catch (Exception e ) {
            ...
}

Run Code Online (Sandbox Code Playgroud)

但是,它没有在12小时内完成.它被困在方法中selector.SelectAttributes(instances);.

我的问题是: weka的PCA需要这么长的计算时间吗？或者我错误地使用PCA？

如果从长远来看,时间预计:
我怎样才能调整PCA算法跑多快？你能建议一个替代方案吗？(+示例代码如何使用它)？

如果不是:
我做错了什么？我应该如何使用weka调用PCA并降低维数？

更新:评论证实了我的怀疑,它花费的时间比预期的要多得多.
我想知道:我怎样才能在java中获得PCA - 使用weka或替代库.
为此添加了赏金.

Answer 1

ami*_*mit 10

在加深WEKA代码后,瓶颈创建协方差矩阵,然后计算该矩阵的特征向量.即使尝试切换到sparsed矩阵实现(我使用COLT的SparseDoubleMatrix2D)也无济于事.

我想出的解决方案是首先使用第一种快速方法(我使用信息增益排序器,基于文档频率进行过滤)来降低维度,然后在降低的维度上使用PCA来进一步减少维度.

代码更复杂,但它基本上归结为:

Ranker ranker = new Ranker();
InfoGainAttributeEval ig = new InfoGainAttributeEval();
Instances instances = SamplesManager.asWekaInstances(trainSet);
ig.buildEvaluator(instances);
firstAttributes = ranker.search(ig,instances);
candidates = Arrays.copyOfRange(firstAttributes, 0, FIRST_SIZE_REDUCTION);
instances = reduceDimenstions(instances, candidates)
PrincipalComponents pca = new PrincipalComponents();
pca.setVarianceCovered(var);
ranker = new Ranker();
ranker.setNumToSelect(numFeatures);
selection = new AttributeSelection();
selection.setEvaluator(pca);
selection.setSearch(ranker);
selection.SelectAttributes(instances );
instances = selection.reduceDimensionality(wekaInstances);

Run Code Online (Sandbox Code Playgroud)

然而,当我对估计的准确度进行交叉验证时,这种方法得分更差,然后使用贪婪的信息增益和排名.

我可以看到完整的代码吗？包括如何对估计的准确性进行交叉验证 (2认同)

归档时间：	14 年前
查看次数：	6146 次
最近记录：	10 年，6 月前