我可以确定性地对任意排列的浮点数的向量求和吗?

Ric*_*ard 6 algorithm math floating-point deterministic

假设我有一个(可能很大)的浮点数向量,这些向量是由一些黑盒过程产生的。是否可以计算这些数字的按位可重现的总和?

如果黑盒过程总是以相同的顺序生成数字,那么按位可重现的求和很容易:只需从左到右求和即可。

但是,如果数字以随机顺序生成,也许是因为它们是从异步进程返回和收集的,那么就更难了:我必须对它们进行数字排序。

但是,如果有更多的数字,可能分布在不同的机器上,因此移动它们是不可取的怎么办?

还有一种方法可以确定地对它们求和吗?

Ric*_*ard 5

概述:多遍数据变异方法

\n

(请参阅此处以获得更好的答案。)

\n

是的,有办法。

\n

Demmel 和 Nguyen (2013) 的论文“快速再现浮点求和”描述了一种实现此目的的方法。我在下面提供了它的串行和并行实现。

\n

这里我们将比较三种算法:

\n
    \n
  1. 传统的从左到右求和:对于输入顺序的排列而言,这种求和速度快、不准确且不可重现。
  2. \n
  3. Kahan 求和:速度较慢,非常准确(在输入大小上基本上是O(1)),并且虽然在输入顺序的排列方面不可再现,但更接近狭义上的再现性。
  4. \n
  5. 确定性求和:这比 Kahan 求和稍慢,但相当准确,并且可以按位重现。
  6. \n
\n

传统的求和是不准确的,因为随着总和的增加,我们添加的数字的最低有效位会默默地被丢弃。卡汉求和通过保持运行的“补偿”总和来克服这个问题,该总和保留在最低有效位上。确定性求和使用类似的策略来保持准确性,但在执行求和之前,将输入数字“预舍入”到公共基数,以便可以准确地计算它们的总和,而没有任何舍入误差。

\n

测试

\n

为了研究算法,我们将对 100 万个数字进行测试,每个数字都是从 [-1000, 1000] 范围内均匀抽取的。为了展示算法的答案范围及其确定性,输入将被打乱并求和 100 次。并行算法使用 12 个线程运行。

\n

“正确”求和(通过在长双精度模式下使用 Kahan 求和并选择最常出现的值来确定)是

\n
310844.700699143717685046795\n
Run Code Online (Sandbox Code Playgroud)\n

我要断言的结果是,对于所研究的每种数据类型,确定性算法对于 100 个数据排序中的每一个都会产生相同的结果,并且这些结果对于串行和并行变体都是相同的算法的。

\n

各种算法的结果如下:

\n
Serial Float\n=========================================================\nSimple summation accumulation type   = float\nAverage deterministic summation time = 0.00462s\nAverage simple summation time        = 0.00144s (3.20x faster than deterministic)\nAverage Kahan summation time         = 0.00290s (1.59x faster than deterministic)\nDeterministic value                  = 256430592.000000000 (very different from actual)\nDistinct Kahan values                = 3  (with single-precision accumulator)\nDistinct Simple values               = 93 (with single-precision accumulator)\nDistinct Simple values               = 1  (with double-precision accumulator)\n\nParallel Float\n=========================================================\nSimple summation accumulation type   = float\nAverage deterministic summation time = 0.00576s\nAverage simple summation time        = 0.00206s (2.79x faster than deterministic)\nAverage Kahan summation time         = 0.00599s (0.96x faster than deterministic)\nDeterministic value                  = 256430592.000000000 (very different from actual)\nDistinct Kahan values                = 3  (with single-precision accumulator)\nDistinct Simple values               = 89 (with single-precision accumulator)\nDistinct Simple values               = 1  (with double-precision accumulator)\n\nSerial Double\n=========================================================\nAverage deterministic summation time = 0.00600s\nAverage simple summation time        = 0.00171s (3.49x faster than deterministic)\nAverage Kahan summation time         = 0.00378s (1.58x faster than deterministic)\nDeterministic value                  = 310844.70069914375199005 (epsilon difference from actual value)\nDistinct Kahan values                = 4\nDistinct Simple values               = 88\n\nParallel Double\n=========================================================\nAverage deterministic summation time = 0.01074s\nAverage simple summation time        = 0.00289s (3.71x faster than deterministic)\nAverage Kahan summation time         = 0.00648s (1.65x faster than deterministic)\nDeterministic value                  = 310844.70069914375199005 (epsilon difference from actual value)\nDistinct Kahan values                = 2\nDistinct Simple values               = 83\n\nSerial Long Double\n=========================================================\nAverage deterministic summation time = 0.01072s\nAverage simple summation time        = 0.00215s (4.96x faster than deterministic)\nAverage Kahan summation time         = 0.00448s (2.39x faster than deterministic)\nDeterministic value                  = 310844.700699143717685046795 (no discernable difference from actual)\nDistinct Kahan values                = 3\nDistinct Simple values               = 94\n\nParallel Long Double\n=========================================================\nAverage deterministic summation time = 0.01854s\nAverage simple summation time        = 0.00466s (3.97x faster than deterministic)\nAverage Kahan summation time         = 0.00635s (2.91x faster than deterministic)\nDeterministic value                  = 310844.700699143717685046795 (no discernable difference from actual)\nDistinct Kahan values                = 1\nDistinct Simple values               = 82\n
Run Code Online (Sandbox Code Playgroud)\n

讨论

\n

所以我们学了什么?从单精度结果中我们看到,使用双长度累加器使得传统的求和算法对于该数据集是可重现的,尽管对于任意数据集几乎肯定不会出现这种情况。尽管如此,这仍然是一种在有效时提高准确性的廉价方法。

\n

当传统求和使用的累加器与要相加的数字大小相同时,它的工作效果非常差:我们运行了 100 次测试,并得到了与传统求和几乎一样多的不同结果。

\n

另一方面,卡汉求和产生的不同结果非常少(因此它“更具”确定性)并且花费的时间大约是传统求和的两倍。

\n

确定性算法比简单求和长约 4 倍,比 Kahan 求和长约 1.5-3 倍,但对于 double 和 long double 类型,除了按位确定性之外,得到的答案大致相同(无论输入如何,它总是得到相同的答案)数据已排序)。

\n

然而,单精度浮点得到的答案非常糟糕,这与单精度常规求和和卡汉求和不同,后者与实际答案非常接近。为什么是这样?

\n

我们一直在研究的论文确定,如果输入中有n 个数字,并且我们使用k轮折叠(论文建议使用 2 轮,这就是我们在这里使用的),那么确定性和常规和将具有相似的误差范围假如

\n
n^k * e^(k-1) << 1\n
Run Code Online (Sandbox Code Playgroud)\n

其中e是数据类型的浮点 epsilon。这些 epsilon 值为:

\n
Single-precision epsilon      = 0.000000119\nDouble-precision epsilon      = 0.00000000000000022\nLong double-precision epsilon = 0.000000000000000000108\n
Run Code Online (Sandbox Code Playgroud)\n

将它们与n=1M一起代入方程,我们得到:

\n
Single condition      = 119000\nDouble condition      = 0.00022\nLong double condition = 0.000000108\n
Run Code Online (Sandbox Code Playgroud)\n

因此我们看到,对于这种大小的输入,单精度的性能会很差。

\n

另一点需要注意的是,虽然long double在我的机器上占用 16 个字节,但这仅用于对齐目的:用于计算的真实长度仅为 80 位。因此,两个长双精度数在数值上相等,但它们未使用的位是任意的。为了真正的按位再现性,需要确定未使用的位并将其设置为已知值。

\n

代码

\n
//Compile with:\n//g++ -g -O3 repro_vector.cpp -fopenmp\n\n//NOTE: Always comile with `-g`. It doesn\'t slow down your code, but does make\n//it debuggable and improves ease of profiling\n\n#include <algorithm>\n#include <bitset>               //Used for showing bitwise representations\n#include <cfenv>                //Used for setting floating-point rounding modes\n#include <chrono>               //Used for timing algorithms\n#include <climits>              //Used for showing bitwise representations\n#include <iostream>\n#include <omp.h>                //OpenMP\n#include <random>\n#include <stdexcept>\n#include <string>\n#include <typeinfo>\n#include <unordered_map>\n#include <vector>\n\nconstexpr int ROUNDING_MODE = FE_UPWARD;\nconstexpr int N = 1\'000\'000;\nconstexpr int TESTS = 100;\n\n// Simple timer class for tracking cumulative run time of the different\n// algorithms\nstruct Timer {\n  double total = 0;\n  std::chrono::high_resolution_clock::time_point start_time;\n\n  Timer() = default;\n\n  void start() {\n    start_time = std::chrono::high_resolution_clock::now();\n  }\n\n  void stop() {\n    const auto now = std::chrono::high_resolution_clock::now();\n    const auto time_span = std::chrono::duration_cast<std::chrono::duration<double>>(now - start_time);\n    total += time_span.count();\n  }\n};\n\n//Simple class to enable directed rounding in floating-point math and to reset\n//the rounding mode afterwards, when it goes out of scope\nstruct SetRoundingMode {\n  const int old_rounding_mode;\n\n  SetRoundingMode(const int mode) : old_rounding_mode(fegetround()) {\n    if(std::fesetround(mode)!=0){\n      throw std::runtime_error("Failed to set directed rounding mode!");\n    }\n  }\n\n  ~SetRoundingMode(){\n    if(std::fesetround(old_rounding_mode)!=0){\n      throw std::runtime_error("Failed to reset rounding mode to original value!");\n    }\n  }\n\n  static std::string get_rounding_mode_string() {\n    switch (fegetround()) {\n      case FE_DOWNWARD:   return "downward";\n      case FE_TONEAREST:  return "to-nearest";\n      case FE_TOWARDZERO: return "toward-zero";\n      case FE_UPWARD:     return "upward";\n      default:            return "unknown";\n    }\n  }\n};\n\n// Used to make showing bitwise representations somewhat more intuitive\ntemplate<class T>\nstruct binrep {\n  const T val;\n  binrep(const T val0) : val(val0) {}\n};\n\n// Display the bitwise representation\ntemplate<class T>\nstd::ostream& operator<<(std::ostream& out, const binrep<T> a){\n  const char* beg = reinterpret_cast<const char*>(&a.val);\n  const char *const end = beg + sizeof(a.val);\n  while(beg != end){\n    out << std::bitset<CHAR_BIT>(*beg++);\n    if(beg < end)\n      out << \' \';\n  }\n  return out;\n}\n\n\n\n//Simple serial summation algorithm with an accumulation type we can specify\n//to more fully explore its behaviour\ntemplate<class FloatType, class SimpleAccumType>\nFloatType serial_simple_summation(const std::vector<FloatType> &vec){\n  SimpleAccumType sum = 0;\n  for(const auto &x: vec){\n    sum += x;\n  }\n  return sum;\n}\n\n//Parallel variant of the simple summation algorithm above\ntemplate<class FloatType, class SimpleAccumType>\nFloatType parallel_simple_summation(const std::vector<FloatType> &vec){\n  SimpleAccumType sum = 0;\n  #pragma omp parallel for default(none) reduction(+:sum) shared(vec)\n  for(size_t i=0;i<vec.size();i++){\n    sum += vec[i];\n  }\n  return sum;\n}\n\n\n\n//Kahan\'s compensated summation algorithm for accurately calculating sums of\n//many numbers with O(1) error\ntemplate<class FloatType>\nFloatType serial_kahan_summation(const std::vector<FloatType> &vec){\n  FloatType sum = 0.0f;\n  FloatType c = 0.0f;\n  for (const auto &num: vec) {\n    const auto y = num - c;\n    const auto t = sum + y;\n    c = (t - sum) - y;\n    sum = t;\n  }\n  return sum;\n}\n\n//Parallel version of Kahan\'s compensated summation algorithm (could be improved\n//by better accounting for the compsenation during the reduction phase)\ntemplate<class FloatType>\nFloatType parallel_kahan_summation(const std::vector<FloatType> &vec){\n  //Parallel phase\n  std::vector<FloatType> sum(omp_get_max_threads(), 0);\n  FloatType c = 0;\n  #pragma omp parallel for default(none) firstprivate(c) shared(sum,vec)\n  for (size_t i=0;i<vec.size();i++) {\n    const auto tid = omp_get_thread_num();\n    const auto y = vec[i] - c;\n    const auto t = sum.at(tid) + y;\n    c = (t - sum[tid]) - y;\n    sum[tid] = t;\n  }\n\n  //Serial reduction phase\n\n  //This could be more accurate if it took the remaining compensation values\n  //from above into account\n  FloatType total_sum = 0.0f;\n  FloatType total_c = 0.0f;\n  for(const auto &num: sum){\n    const auto y = num - total_c;\n    const auto t = total_sum + y;\n    total_c = (t - total_sum) - y;\n    total_sum = t;\n  }\n\n  return total_sum;\n}\n\n\n\n// Error-free vector transformation. Algorithm 4 from Demmel and Nguyen (2013)\ntemplate<class FloatType>\nFloatType ExtractVectorNew2(\n  const FloatType M,\n  const typename std::vector<FloatType>::iterator &begin,\n  const typename std::vector<FloatType>::iterator &end\n){\n  // Should use the directed rounding mode of the parent thread\n\n  auto Mold = M;\n  for(auto v=begin;v!=end;v++){\n    auto Mnew = Mold + (*v);\n    auto q = Mnew - Mold;\n    (*v) -= q;\n    Mold = Mnew;\n  }\n\n  //This is the exact sum of high order parts q_i\n  //v is now the vector of low order parts r_i\n  return Mold - M;\n}\n\ntemplate<class FloatType>\nFloatType mf_from_deltaf(const FloatType delta_f){\n  const int power = std::ceil(std::log2(delta_f));\n  return static_cast<FloatType>(3.0) * std::pow(2, power);\n}\n\n//Implements the error bound discussed near Equation 6 of\n//Demmel and Nguyen (2013).\ntemplate<class FloatType>\nbool is_error_bound_appropriate(const size_t N, const int k){\n  const auto eps = std::numeric_limits<FloatType>::epsilon();\n  const auto ratio = std::pow(N, k) * std::pow(eps, k-1);\n  //If ratio << 1, then the conventional non-reproducible sum and the\n  //deterministic sum will have error bounds of the same order. We arbitrarily\n  //choose 1e-4 to represent this\n  return ratio < 1e-3;\n}\n\n//Serial bitwise deterministic summation.\n//Algorithm 8 from Demmel and Nguyen (2013).\ntemplate<class FloatType>\nFloatType serial_bitwise_deterministic_summation(\n  std::vector<FloatType> vec,  // Note that we\'re making a copy!\n  const int k\n){\n  constexpr FloatType eps = std::numeric_limits<FloatType>::epsilon();\n  const auto n = vec.size();\n  const auto adr = SetRoundingMode(ROUNDING_MODE);\n\n  if(n==0){\n    return 0;\n  }\n\n  if(!is_error_bound_appropriate<FloatType>(vec.size(), k)){\n    std::cout<<"WARNING! Error bounds of deterministic sum are large relative to conventional summation!"<<std::endl;\n  }\n\n  FloatType m = std::abs(vec.front());\n  for(const auto &x: vec){\n    m = std::max(m, std::abs(x));\n  }\n\n  FloatType delta_f = n * m / (1 - 4 * (n + 1) * eps);\n  FloatType Mf = mf_from_deltaf(delta_f);\n\n  std::vector<FloatType> Tf(k);\n  for(int f=0;f<k-1;f++){\n    Tf[f] = ExtractVectorNew2<FloatType>(Mf, vec.begin(), vec.end());\n    delta_f = n * (4 * eps * Mf / 3) / (1 - 4 * (n + 1) * eps);\n    Mf = mf_from_deltaf(delta_f);\n  }\n\n  FloatType M = Mf;\n  for(const FloatType &v: vec){\n    M += v;\n  }\n  Tf[k-1] = M - Mf;\n\n  FloatType T = 0;\n  for(const FloatType &tf: Tf){\n    T += tf;\n  }\n\n  return T;\n}\n\n//Parallel bitwise deterministic summation.\n//Algorithm 9 from Demmel and Nguyen (2013).\ntemplate<class FloatType>\nFloatType parallel_bitwise_deterministic_summation(\n  std::vector<FloatType> vec,  // Note that we\'re making a copy!\n  const int k\n){\n  constexpr FloatType eps = std::numeric_limits<FloatType>::epsilon();\n  const auto n = vec.size();\n  const auto adr = SetRoundingMode(ROUNDING_MODE);\n\n  if(n==0){\n    return 0;\n  }\n\n  if(!is_error_bound_appropriate<FloatType>(vec.size(), k)){\n    std::cout<<"WARNING! Error bounds of deterministic sum are large relative to conventional summation!"<<std::endl;\n  }\n\n  std::vector<FloatType> Tf(k);\n\n  // Note that this reduction would require communication if we had\n  // implemented this to work on multiple nodes\n  FloatType m = std::abs(vec.front());\n  #pragma omp parallel for default(none) reduction(max:m) shared(vec)\n  for(size_t i=0;i<vec.size();i++){\n    m = std::max(m, std::abs(vec[i]));\n  }\n\n  // Note that this reduction would require communication if we had\n  // implemented this to work on multiple nodes\n  #pragma omp declare reduction(vec_plus : std::vector<FloatType> : \\\n    std::transform(omp_out.begin(), omp_out.end(), omp_in.begin(), omp_out.begin(), std::plus<FloatType>())) \\\n    initializer(omp_priv = decltype(omp_orig)(omp_orig.size()))\n\n  #pragma omp parallel default(none) reduction(vec_plus:Tf) shared(k,m,n,vec,std::cout)\n  {\n    const auto adr = SetRoundingMode(ROUNDING_MODE);\n    const auto threads = omp_get_num_threads();\n    const auto tid = omp_get_thread_num();\n    const auto values_per_thread = n / threads;\n    const auto nlow = tid * values_per_thread;\n    const auto nhigh = (tid<threads-1) ? ((tid+1) * values_per_thread) : n;\n\n    FloatType delta_f = n * m / (1 - 4 * (n + 1) * eps);\n    FloatType Mf = mf_from_deltaf(delta_f);\n\n    for(int f=0;f<k-1;f++){\n      Tf[f] = ExtractVectorNew2<FloatType>(Mf, vec.begin() + nlow, vec.begin() + nhigh);\n      delta_f = n * (4 * eps * Mf / 3) / (1 - 4 * (n + 1) * eps);\n      Mf = mf_from_deltaf(delta_f);\n    }\n\n    FloatType M = Mf;\n    for(size_t i=nlow;i<nhigh;i++){\n      M += vec[i];\n    }\n    Tf[k-1] = M - Mf;\n  }\n\n  FloatType T = 0;\n  for(const FloatType &tf: Tf){\n    T += tf;\n  }\n\n  return T;\n}\n\n\n\n//Convenience wrappers\ntemplate<bool Parallel, class FloatType>\nFloatType bitwise_deterministic_summation(\n  const std::vector<FloatType> &vec,  // Note that we\'re making a copy!\n  const int k\n){\n  if(Parallel){\n    return parallel_bitwise_deterministic_summation<FloatType>(vec, k);\n  } else {\n    return serial_bitwise_deterministic_summation<FloatType>(vec, k);\n  }\n}\n\ntemplate<bool Parallel, class FloatType, class SimpleAccumType>\nFloatType simple_summation(const std::vector<FloatType> &vec){\n  if(Parallel){\n    return parallel_simple_summation<FloatType, SimpleAccumType>(vec);\n  } else {\n    return serial_simple_summation<FloatType, SimpleAccumType>(vec);\n  }\n}\n\ntemplate<bool Parallel, class FloatType>\nFloatType kahan_summation(const std::vector<FloatType> &vec){\n  if(Parallel){\n    return serial_kahan_summation<FloatType>(vec);\n  } else {\n    return parallel_kahan_summation<FloatType>(vec);\n  }\n}\n\n\n\n// Timing tests for the summation algorithms\ntemplate<bool Parallel, class FloatType, class SimpleAccumType>\nFloatType PerformTestsOnData(\n  const int TESTS,\n  std::vector<FloatType> floats, //Make a copy so we use the same data for each test\n  std::mt19937 gen               //Make a copy so we use the same data for each test\n){\n  Timer time_deterministic;\n  Timer time_kahan;\n  Timer time_simple;\n\n  //Very precise output\n  std::cout.precision(std::numeric_limits<FloatType>::max_digits10);\n  std::cout<<std::fixed;\n\n  std::cout<<"Parallel? "<<Parallel<<std::endl;\n  if(Parallel){\n    std::cout<<"Max threads = "<<omp_get_max_threads()<<std::endl;\n  }\n  std::cout<<"Floating type                        = "<<typeid(FloatType).name()<<std::endl;\n  std::cout<<"Floating type epsilon                = "<<std::numeric_limits<FloatType>::epsilon()<<std::endl;\n  std::cout<<"Simple summation accumulation type   = "<<typeid(SimpleAccumType).name()<<std::endl;\n  std::cout<<"Number of tests                      = "<<TESTS<<std::endl;\n  std::cout<<"Input sample = "<<std::endl;\n  for(size_t i=0;i<10;i++){\n    std::cout<<"\\t"<<floats[i]<<std::endl;\n  }\n\n  //Get a reference value\n  std::unordered_map<FloatType, uint32_t> simple_sums;\n  std::unordered_map<FloatType, uint32_t> kahan_sums;\n  const auto ref_val = bitwise_deterministic_summation<Parallel, FloatType>(floats, 2);\n  for(int test=0;test<TESTS;test++){\n    std::shuffle(floats.begin(), floats.end(), gen);\n\n    time_deterministic.start();\n    const auto my_val = bitwise_deterministic_summation<Parallel, FloatType>(floats, 2);\n    time_deterministic.stop();\n    if(ref_val!=my_val){\n      std::cout<<"ERROR: UNEQUAL VALUES ON TEST #"<<test<<"!"<<std::endl;\n      std::cout<<"Reference      = "<<ref_val                   <<std::endl;\n      std::cout<<"Current        = "<<my_val                    <<std::endl;\n      std::cout<<"Reference bits = "<<binrep<FloatType>(ref_val)<<std::endl;\n      std::cout<<"Current   bits = "<<binrep<FloatType>(my_val) <<std::endl;\n      throw std::runtime_error("Values were not equal!");\n    }\n\n    time_kahan.start();\n    const auto kahan_sum = kahan_summation<Parallel, FloatType>(floats);\n    kahan_sums[kahan_sum]++;\n    time_kahan.stop();\n\n    time_simple.start();\n    const auto simple_sum = simple_summation<Parallel, FloatType, SimpleAccumType>(floats);\n    simple_sums[simple_sum]++;\n    time_simple.stop();\n  }\n\n  std::cout<<"Average deterministic summation time = "<<(time_deterministic.total/TESTS)<<std::endl;\n  std::cout<<"Average simple summation time        = "<<(time_simple.total/TESTS)<<std::en

  • 这里使用的数字分布不清楚,可能无法模拟实际应用;`std::uniform_real_distribution` 未指定。该标准将其指定为实数上的均匀分布(概率密度在一个区间内是恒定的),但忽略了浮点数在该区间内不是均匀密集的事实。生成“均匀”分布的常见方法(生成整数并缩放到区间)忽略了更精细(较低指数)的浮点数,因此可能无法模拟实际结果。 (2认同)

Ric*_*ard 1

概述:单遍数据保存方法

\n

是的!

\n

Ahrens、Demmel 和 Nguyen (2020) 的论文“高效可再现浮点求和算法”中描述了一种实现此目的的方法。我在下面包含了它的实现。

\n

这里我们将比较三种算法:

\n
    \n
  1. 传统的从左到右求和:对于输入顺序的排列而言,这种求和速度快、不准确且不可重现。
  2. \n
  3. Kahan 求和:速度较慢,非常准确(在输入大小上基本上是O(1)),并且虽然在输入顺序的排列方面不可再现,但更接近狭义上的再现性。
  4. \n
  5. 可重现的求和:这比 Kahan 求和稍微慢一些,可以相当准确,并且可以按位重现。
  6. \n
\n

传统的求和是不准确的,因为随着总和的增加,我们添加的数字的最低有效位会默默地被丢弃。卡汉求和通过保持运行的“补偿”总和来克服这个问题,该总和保留在最低有效位上。可再现求和使用类似的策略来保持准确性,但维护与不同补偿水平相对应的多个仓。这些箱还适当地截断输入数据的重要性以获得准确、可重复的总和。

\n

第一次测试

\n

为了研究算法,我们将对 100 万个数字进行测试,每个数字都是从 [-1000, 1000] 范围内均匀抽取的。为了展示算法的答案范围及其再现性,输入将被打乱并求和 100 次。

\n

我将断言的结果是,对于所研究的每种数据类型,确定性算法对于 100 个数据排序中的每一个都会产生相同的结果,下面的代码将严格演示这一结果。

\n

我的实现包括一个类,可将累加器值传递给它,但有多种方法可以传递输入数据。我们将尝试其中三个:

\n
    \n
  • 1ata测试一种模式,在该模式中,每次将一个数字传递到累加器,而无需事先了解输入的最大幅度
  • \n
  • 许多测试一种模式,其中许多数字被传递到累加器,但事先不知道输入的最大幅度
  • \n
  • Manyc测试一种模式,其中许多数字被传递到累加器,并且我们事先知道输入的最大幅度
  • \n
\n

各种算法的结果如下:

\n
Single-Precision Floats\n==================================================================\nAverage Reproducible sum 1ata time   = 0.0115s (11.5x simple; 2.6x Kahan)\nAverage Reproducible sum many time   = 0.0047s (4.7x simple; 1.1x Kahan)\nAverage Reproducible sum manyc time  = 0.0036s (3.6x simple; 0.8x Kahan)\nAverage Kahan summation time         = 0.0044s\nAverage simple summation time        = 0.0010s\nReproducible Value                   = 767376.500000000 (0% diff vs Kahan)\nKahan long double accumulator value  = 767376.500000000\nError bound on RV                    = 15.542840004\nDistinct Kahan (single accum) values = 1\nDistinct Simple values               = 92\n\n\nDouble-Precision Floats\n==================================================================\nAverage Reproducible sum 1ata time   = 0.0112s (10.5x simple; 2.6x Kahan)\nAverage Reproducible sum many time   = 0.0051s (4.7x simple; 1.2x Kahan)\nAverage Reproducible sum manyc time  = 0.0037s (3.4x simple; 0.8x Kahan)\nAverage simple summation time        = 0.0011s\nAverage Kahan summation time         = 0.0044s\nReproducible Value                   = 310844.70069914369378239 (0% diff vs Kahan)\nKahan long double accumulator value  = 310844.70069914369378239\nError bound on RV                    = 0.00000000048315059\nDistinct Kahan (double accum) values = 2\nDistinct Simple values               = 96\n
Run Code Online (Sandbox Code Playgroud)\n

第二次测试

\n

让我们尝试比均匀随机更具挑战性的测试!相反,我们在 [0, 2\xcf\x80) 范围内生成 100 万个数据点,并使用它们生成正弦波的 y 值。和以前一样,我们将这些数据点打乱 100 次,看看我们得到什么类型的输出。计时值与上面的非常相似,因此我们将其省略。这给了我们:

\n
Single-Precision Floats\n==================================================================\nReproducible Value                   = 0.000000000\nKahan long double accumulator value  = -0.000000000\nError bound                          = 14.901161194\nDistinct Kahan (single) values       = 96\nDistinct Simple values               = 100\nKahan lower value                    = -0.000024229\nKahan upper value                    = 0.000025988\nSimple lower value                   = -0.017674208\nSimple upper value                   = 0.018800795\nKahan range                          = 0.000050217\nSimple range                         = 0.036475003\nSimple range / Kahan range           = 726\n\nDouble-Precision Floats\n==================================================================\nReproducible Value                   = 0.00000000000002185\nKahan long double accumulator value  = 0.00000000000002185\nError bound                          = 0.00000000000000083\nDistinct Kahan (double) values       = 93\nDistinct Simple values               = 100\nKahan lower value                    = 0.00000000000000017\nKahan upper value                    = 0.00000000000006306\nSimple lower value                   = -0.00000000004814216\nSimple upper value                   = 0.00000000002462563\nKahan range                          = 0.00000000000006289\nSimple range                         = 0.00000000007276779\nSimple range / Kahan range           = 1157\n
Run Code Online (Sandbox Code Playgroud)\n

在这里,我们看到可再现值与我们用作事实来源的长双卡汉求和值完全匹配。

\n

与我们之前更简单的测试结果相比,有许多不同的 Kahan 求和值(其中 Kahan 求和使用与数据相同的浮点类型);使用更困难的数据集破坏了我们在更简单的测试中在卡汉求和中观察到的“部分决定论”,尽管卡汉求和产生的值范围比通过简单求和获得的值小几个数量级。

\n

因此,在这种情况下,可再现求和比 Kahan 求和具有更好的精度,并且产生单个按位可再现结果,而不是约 100 个不同的结果。

\n

成本

\n

时间。可重复求和的时间比简单求和长约 3.5 倍,与 Kahan 求和的时间大致相同。

\n

记忆。使用 3 倍合并需要6*sizeof(floating_type)累加器的空间。即单精度为 24 字节,双精度为 48 字节。

\n

此外,还需要一个静态查找表,该表可以在给定数据类型和折叠的所有累加器之间共享。对于 3 倍装箱,该表为 41 个浮点型(164 字节)或 103 个双精度型(824 字节)。

\n

讨论

\n

所以我们学了什么?

\n

标准求和给出了广泛的结果。我们运行了 100 次测试,得到了 96 个双精度类型的独特结果。这显然是不可重现的,并且很好地证明了我们正在尝试解决的问题。

\n

另一方面,卡汉求和产生的不同结果非常少(因此它“更具”确定性)并且花费的时间大约是传统求和的 4 倍。

\n

如果我们对要相加的数字一无所知,那么可重复的算法需要比简单求和长约 10 倍的时间;然而,如果我们知道被加数的最大幅度,那么可重现的算法只需要比简单求和长约 3.5 倍的时间。

\n

与卡汉求和相比,可重现的算法有时需要更少的时间(!),同时确保我们得到按位相同的答案。此外,与此答案中详细介绍的方法不同,在我们的简单测试中,可重现的结果与单精度和双精度的 Kahan 求和结果相匹配。我们还得到了可重复性和卡汉结果相对误差非常低的有力保证。

\n

什么折叠级别最好?3、如下图所示。

\n

时间和错误与折叠

\n

Fold-1 非常糟糕。对于此测试,2 对于 double 具有良好的精度,但对于 float 则有 10% 的相对误差。Fold-3 为 double 和 float 提供了良好的精度,但时间增加较小。因此,我们只希望对双打进行一次一次可重复求和的fold-2。高于 Fold-3 只能带来边际效益。

\n

代码

\n

完整代码太长,无法在此包含;但是,您可以在这里找到 C++ 版本;和这里的ReproBLAS 。

\n