科学编程的再现性

Question

科学编程的再现性

除了产生不正确的结果之外,科学编程中最令人担忧的一个问题是无法重现您生成的结果.哪些最佳实践有助于确保您的分析具有可重现性？

Answer 1

Publish the original raw data online and make it freely available for download.
Make the code base open source and available online for download.
If randomization is used in optimization, then repeat the optimization several times, choosing the best value that results or use a fixed random seed, so that the same results are repeated.
Before performing your analysis, you should split the data into a "training/analysis" dataset and a "testing/validation" dataset. Perform your analsysis on the "training" dataset, and make sure that the results that you get still hold on the "validation" dataset to ensure that your analysis is actually generalizable and isn't simply memorizing peculiarities of the dataset in question.

The first two points are incredibly important, because making the dataset available allows others to perform their own analyses on the same data, which increases the level of confidence in the validity of your own analyses. Additionally, making the dataset available online -- especially if you use linked data formats -- makes it possible for crawlers to aggregate your dataset with other datasets, thereby enabling analyses with larger data sets... in many types of research, the sample size is sometimes too small to be really confident about the results... but sharing your dataset makes it possible to construct very large datasets. Or, someone could use your dataset to validate the analysis that they performed on some other dataset.

Additionally, making your code open source makes it possible for the code and procedure to be reviewed by your peers. Often such reviews lead to the discovery of flaws or of the possibility for additional optimization and improvement. Most importantly, it allows other researchers to improve on your methods, without having to implement everything that you have already done from scratch. It very greatly accelerates the pace of research when researches can focus on just improvements and not on reinventing the wheel.

As for randomization... many algorithms rely on randomization to achieve their results. Stochastic and Monte Carlo methods are quite common, and while they have been proven to converge for certain cases, it is still possible to get different results. The way to ensure that you get the same results, is to have a loop in your code that invokes the computation some fixed number of times, and to choose the best result. If you use enough repititions, you can expect to find global or near-global optima instead of getting stuck in local optima. Another possibility is to use a predetermined seed, although that is not, IMHO, as good an approach since you could pick a seed that causes you to get stuck in local optima. In addition, there is no guarantee that random number generators on different platforms will generate the same results for that seed value.

Answer 2

Hig*_*ark 9

我是一名嵌入研究地球物理学家团队的软件工程师,我们目前(一如既往)致力于提高我们根据需求重现结果的能力.以下是从我们的经验中收集到的一些指示:

将所有内容置于版本控制之下:源代码,输入数据集,makefile等
构建可执行文件时:我们在可执行文件中嵌入了编译器指令,我们使用UUID标记构建日志,并使用相同的UUID标记可执行文件,计算可执行文件的校验和,自动生成所有内容并自动更新数据库(好吧,它只是一个平面文件真的)与构建细节等
我们使用Subversion的关键字在每个源代码中包含修订号(等),并将它们写入生成的任何输出文件中.
我们进行了大量(半)自动回归测试,以确保新版本的代码或新构建变体产生相同(或类似的)结果,并且我正在研究一系列程序来量化所做的更改发生.
我的地球物理学家同事确实分析了计划对投入变化的敏感性.我分析了他们(代码,而不是地理)对编译器设置,平台等的敏感性.

我们目前正在开发一个工作流系统,它将记录每个作业的详细信息:输入数据集(包括版本),输出数据集,使用的程序(包括版本和变体),参数等 - 通常称为出处.一旦启动并运行,发布结果的唯一方法就是使用工作流程系统.任何输出数据集都将包含它们自己的出处的详细信息,尽管我们尚未对其进行详细设计.

对于将数值结果再现到最不重要的数字,我们(或许太)放松了.我们工作的科学基础以及我们基础数据集测量中固有的误差,不支持超过2或3 sf的任何数值结果的有效性

我们当然不会为同行评审发布代码或数据,我们从事石油业务.

Answer 3

dmc*_*kee 8

已经有很多好的建议.我要补充一下(两者都来自痛苦的经历--- 在发布之前,谢天谢地!),

1)检查结果的稳定性:

尝试几种不同的数据子集
重新输入
重新输入输出
调整网格间距
尝试几个随机种子(如适用)

如果它不稳定,你就没有完成.

发布上述测试的结果(或至少保留证据,并提及您已完成测试).

2)抽查中间结果

是的,您可能会在一个小样本上开发该方法,然后研究整个混乱.磨削正在进行中,几次达到中间峰值.更好的是,在可能的情况下收集中间步骤的统计数据并查找其中的异常迹象.

再一次,任何惊喜,你必须回去再做一次.

并且,再次保留和/或发布此内容.

我已经提到过的东西包括

源头控制---无论如何你都需要它.
记录构建环境.出版同样很好.
计划提供代码和数据.

另一个没有人提到:

3)记录代码

是的,你正在忙着写它,并且可能正忙于设计它.但我并不是说一份详细的文件,而是对所有惊喜的一个很好的解释.无论如何你都需要写出来,所以把它当作纸上的先机.并且您可以将文档保存在源代码管理中,以便您可以随意丢弃不再适用的块 - 如果您需要它们,它们将在那里.

使用构建指令和"如何运行"模糊来构建一个README并不会有什么坏处.如果你打算提供代码,人们就会问这些东西......另外,对我来说,回头看看它有助于我保持正轨.

Answer 4

Any*_*orn 6

publish the program code, make it available for review.

This is not directed at you by any means, but here is my rant:

If you do work sponsored by taxpayer money, if you publish the results in peer-reviewed journal, provide the source code, under open source license or in public domain. I am tired of reading about this great algorithm somebody came up with, which they claim does x, but provide no way to verify/check source code. if I cannot see the code, I cannot verify you results, for algorithm implementations can be very drastic differences.

It is not moral in my opinion to keep work paid by taxpayers out of reach of fellow researchers. it's against science to push papers yet provide no tangible benefit to public in terms of usable work.

Answer 5

Jac*_*tte 5

我认为以前的许多答案都遗漏了您问题的“科学计算”部分，并以适用于任何科学的非常笼统的内容（将数据和方法公开，专门针对CS）进行了回答。

他们所缺少的是您必须更加专业：必须指定使用的编译器版本，编译时使用的开关，使用的操作系统版本，所有库的版本。链接，正在使用的硬件，同时在计算机上运行的其他硬件，等等。那里有已发表的论文，其中所有这些因素均以不平凡的方式影响结果。

例如，（在Intel硬件上）您可能正在使用一个使用FPU的80位浮点数的库，进行O / S升级，现在该库现在可能仅使用64位double，如果您的结果可能会发生巨大变化问题是病情最少。

编译器升级可能会更改默认的取整行为，或者单个优化可能会按2条指令的顺序执行，而对于状况不佳的系统，同样会出现繁荣，结果有所不同。

哎呀，在实践测试中，有些次优算法表现出“最佳”的奇特故事，因为它们是在笔记本电脑上进行测试的，笔记本电脑会在CPU过热时自动降低CPU的速度（这是最优算法的工作）。

从源代码或数据中看不到这些东西。

归档时间：	15 年，7 月前
查看次数：	1087 次
最近记录：	10 年，4 月前