如何快速将一个大整数分成一个单词？

Question

如何快速将一个大整数分成一个单词？

Pav*_*Dev 5 c++ algorithm biginteger integer-division c++20

我目前正在开发一个类来处理大的无符号整数。但是，我需要不完整的功能，即：

\n

bi_uint+=bi_uint- 已经实施。没什么好抱怨的。
bi_uint*=std::uint_fast64_t- 已经实施。没什么好抱怨的。
bi_uint/=std::uint_fast64_t- 已实现，但工作速度非常慢，还需要两倍宽的类型uint_fast64_t。在测试用例中，除法比乘法慢35倍

\n

接下来，我将给出我的除法实现，它基于一个简单的长除法算法：

\n

#include <climits>\n#include <cstdint>\n#include <limits>\n#include <vector>\n\nclass bi_uint\n{\npublic:\n    using u64_t = std::uint_fast64_t;\n    constexpr static std::size_t u64_bits = CHAR_BIT * sizeof(u64_t);\n    using u128_t = unsigned __int128;\n    static_assert(sizeof(u128_t) >= sizeof(u64_t) * 2);\n\n    //little-endian\n    std::vector<u64_t> data;\n\n    //User should guarantee data.size()>0 and val>0\n    void self_div(const u64_t val)\n    {\n        auto it = data.rbegin();\n\n        if(data.size() == 1) {\n            *it /= val;\n            return;\n        }    \n        \n        u128_t rem = 0;\n        if(*it < val) {\n            rem = *it++;\n            data.pop_back();\n        }\n\n        u128_t r = rem % val;\n        while(it != data.rend()) {\n            rem = (r << u64_bits) + *it;\n            const u128_t q = rem / val;\n            r = rem % val;\n            *it++ = static_cast<u64_t>(q);\n        }\n    }\n};\n

Run Code Online (Sandbox Code Playgroud)\n

您可以看到使用了该unsigned __int128类型，因此，此选项不可移植，并且与单个编译器 - GCC 绑定，并且还需要 x64 平台。

\n

阅读有关除法算法的页面后，我觉得合适的算法是“牛顿-拉夫森除法”。然而，“Newton\xe2\x80\x93Raphson 除法”算法对我来说似乎很复杂。我想有一个更简单的算法来划分类型“big_uint/uint”，它具有几乎相同的性能。

\n

问：如何快速将a除以bi_uinta u64_t？

\n

我有大约 10^6 次迭代，每次迭代都使用列出的所有操作

\n

如果这很容易实现，那么我希望具有可移植性而不是使用unsigned __int128. 否则，我宁愿放弃可移植性，而选择更简单的方法。

\n

编辑1： \n这是一个学术项目，我无法使用第三方库。

\n

Answer 1

Art*_*oul 5

第 1 部分（参见下文第 2 部分）

我成功地使用Barrett Reduction在我的旧笔记本电脑上将除法代码加速了5 倍（甚至在 GodBolt 服务器上加速了7.5倍），这是一种允许用多次乘法和加法代替单个除法的技术。今天刚刚从 sctracth 实现了整个代码。

如果您愿意，可以直接跳转到我的文章末尾的代码位置，而无需阅读长篇描述，因为代码完全可以运行，无需任何知识或依赖性。

下面的代码仅适用于 Intel x64，因为我仅使用 Intel指令及其 64 位变体。当然，它也可以为 x32 和其他处理器重写，因为 Barrett 算法是通用的。

为了用简短的伪代码解释整个 Barrett 约简，我将用 Python 编写它，因为它是适合可理解伪代码的最简单语言：

# https://www.nayuki.io/page/barrett-reduction-algorithm

def BarrettR(n, s):
    return (1 << s) // n

def BarrettDivMod(x, n, r, s):
    q = (x * r) >> s
    t = x - q * n
    return (q, t) if t < n else (q + 1, t - n)

Run Code Online (Sandbox Code Playgroud)

基本上，在上面的伪代码中，BarrettR()对于相同的除数仅执行一次（您对整个大整数除法使用相同的单字除数）。BarrettDivMod()每次当你想要进行除法或模数运算时，基本上给定输入x和除数，n它返回 tuple (x / n, x % n)，没有别的，但它比常规除法指令更快。

在下面的 C++ 代码中，我实现了 Barrett 的两个相同函数，但做了一些 C++ 特定的优化以使其更快。优化是可能的，因为除数n始终是 64 位，x是 128 位，但上半部分总是小于n（最后一个假设发生是因为大整数除法中的上半部分始终是余数模数n）。

Barrett 算法适用于n不是 2 的幂的除数，因此0, 1, 2, 4, 8, 16, ...不允许使用类似的除数。这种简单的除数情况可以通过对大整数进行右位移来解决，因为除以 2 的幂只是一个位移。任何其他除数都是允许的，甚至包括不是 2 的幂的除数。

另外值得注意的是， myBarrettDivMod()只接受x严格小于的股息divisor * 2^64，换句话说，128 位股息的上半部分x应该小于除数。对于大整数除法函数来说，这始终是正确的，因为上半部分始终是余数模除数。该规则应该由您检查，它仅在我的调试断言中x检查，并在发布中删除。BarrettDivMod()

您可以注意到BarrettDivMod()有两个大分支，这是同一算法的两个变体，第一个仅使用 CLang/GCC 类型unsigned __int128，第二个仅使用 64 位指令，因此适合 MSVC。

我尝试针对三个编译器CLang / GCC / MSVC，但有些情况下，Barrett 的 MSVC 版本仅快了 2 倍，而 CLang/GCC 则快了 5 倍。也许我在 MSVC 代码中犯了一些错误。

您可以看到我使用您的类bi_uint来测量两个版本的代码的时间 - 使用常规除法和 Barrett。重要的是要注意，我对您的代码进行了相当大的更改，首先不使用u128（以便 MSVC 版本编译没有 u128），其次不修改data向量，因此它只读取除法并且不存储最终结果（此读 -我只需要非常快地运行速度测试，而无需data在每次测试迭代时复制向量）。所以你的代码在我的代码片段中被破坏了，它不能复制粘贴以立即使用，我只使用你的代码进行速度测量。

Barrett 归约的速度更快，不仅因为除法比乘法慢，而且因为现代 CPU 上的乘法和加法都可以很好地流水线化，现代 CPU 可以在一个周期内执行多个 mul 或 add 指令，但前提是这几个 mul/add 指令不执行不依赖于彼此的结果，换句话说CPU可以在一个周期内并行运行多条指令。据我所知，除法不能并行运行，因为CPU内只有单个模块可以进行除法，但它仍然有点流水线化，因为在第一个除法完成50％之后，第二个除法可以并行开始在CPU管道的开始处。

在某些计算机上，您可能会注意到常规除法版本有时会慢得多，发生这种情况是因为 CLang/GCC 会回退到基于库的除法软实现，即使对于 128 位除法也是如此。在这种情况下，我的 Barrett 可能会显示出偶7-10x数倍的加速，因为它不使用库函数。

为了克服上述问题，关于软除法，最好直接使用指令添加汇编代码DIV，或者在编译器中找到一些实现此功能的内部函数（我认为 CLang/GCC 有这样的内部函数）。如果需要的话我也可以编写这个程序集实现，只需在评论中告诉我即可。

更新。正如所承诺的，为 CLang/GCC 实现了 128 位除法的汇编变体，函数UDiv128Asm()。在此更改之后，它被用作 CLang/GCC 128 除法的主要实现，而不是常规的u128(a) / b. #if 0您可以通过替换为函数#if 1内部来返回常规的 u128 实现UDiv128()。

在线尝试一下！

#include <cstdint>
#include <bit>
#include <stdexcept>
#include <string>

#include <immintrin.h>

#if defined(_MSC_VER) && !defined(__clang__)
    #define IS_MSVC 1
#else
    #define IS_MSVC 0
#endif

#if IS_MSVC
    #include <intrin.h>
#endif

#define ASSERT_MSG(cond, msg) { if (!(cond)) throw std::runtime_error("Assertion (" #cond ") failed at line " + std::to_string(__LINE__) + "! Msg: '" + std::string(msg) + "'."); }
#define ASSERT(cond) ASSERT_MSG(cond, "")
#ifdef _DEBUG
    #define DASSERT_MSG(cond, msg) ASSERT_MSG(cond, msg)
#else
    #define DASSERT_MSG(cond, msg)
#endif 
#define DASSERT(cond) DASSERT_MSG(cond, "")

using u16 = uint16_t;
using u32 = uint32_t;
using i64 = int64_t;
using u64 = uint64_t;
using UllPtr = unsigned long long *;

inline int GetExp(double x) {
    return int((std::bit_cast<uint64_t>(x) >> 52) & 0x7FF) - 1023;
}

inline size_t BitSizeWrong(uint64_t x) {
    return x == 0 ? 0 : (GetExp(x) + 1);
}

inline size_t BitSize(u64 x) {
    size_t r = 0;
    if (x >= (u64(1) << 32)) {
        x >>= 32;
        r += 32;
    }
    while (x >= 0x100) {
        x >>= 8;
        r += 8;
    }
    while (x) {
        x >>= 1;
        ++r;
    }
    return r;
}

#if !IS_MSVC
inline u64 UDiv128Asm(u64 h, u64 l, u64 d, u64 * r) {
    u64 q;
    asm (R"(
        .intel_syntax
        mov rdx, %V[h]
        mov rax, %V[l]
        div %V[d]
        mov %V[r], rdx
        mov %V[q], rax
    )"
        : [q] "=r" (q), [r] "=r" (*r)
        : [h] "r" (h), [l] "r" (l), [d] "r" (d)
        : "rax", "rdx"
    );
    return q;
}
#endif

inline std::pair<u64, u64> UDiv128(u64 hi, u64 lo, u64 d) {
    #if IS_MSVC
        u64 r, q = _udiv128(hi, lo, d, &r);
        return std::make_pair(q, r);
    #else
        #if 0
            using u128 = unsigned __int128;
            auto const dnd = (u128(hi) << 64) | lo;
            return std::make_pair(u64(dnd / d), u64(dnd % d));
        #else
            u64 r, q = UDiv128Asm(hi, lo, d, &r);
            return std::make_pair(q, r);
        #endif
    #endif
}

inline std::pair<u64, u64> UMul128(u64 a, u64 b) {
    #if IS_MSVC
        u64 hi, lo = _umul128(a, b, &hi);
        return std::make_pair(hi, lo);
    #else
        using u128 = unsigned __int128;
        auto const x = u128(a) * b;
        return std::make_pair(u64(x >> 64), u64(x));
    #endif
}

inline std::pair<u64, u64> USub128(u64 a_hi, u64 a_lo, u64 b_hi, u64 b_lo) {
    u64 r_hi, r_lo;
    _subborrow_u64(_subborrow_u64(0, a_lo, b_lo, (UllPtr)&r_lo), a_hi, b_hi, (UllPtr)&r_hi);
    return std::make_pair(r_hi, r_lo);
}

inline std::pair<u64, u64> UAdd128(u64 a_hi, u64 a_lo, u64 b_hi, u64 b_lo) {
    u64 r_hi, r_lo;
    _addcarry_u64(_addcarry_u64(0, a_lo, b_lo, (UllPtr)&r_lo), a_hi, b_hi, (UllPtr)&r_hi);
    return std::make_pair(r_hi, r_lo);
}

inline int UCmp128(u64 a_hi, u64 a_lo, u64 b_hi, u64 b_lo) {
    if (a_hi != b_hi)
        return a_hi < b_hi ? -1 : 1;
    return a_lo == b_lo ? 0 : a_lo < b_lo ? -1 : 1;
}

std::pair<u64, size_t> BarrettRS64(u64 n) {
    // https://www.nayuki.io/page/barrett-reduction-algorithm
    ASSERT_MSG(n >= 3 && (n & (n - 1)) != 0, "n " + std::to_string(n))
    size_t const nbits = BitSize(n);
    // 2^s = q * n + r;   2^s = (2^64 + q0) * n + r;  2^s - n * 2^64 = q0 * n + r
    u64 const dnd_hi = (nbits >= 64 ? 0ULL : (u64(1) << nbits)) - n;
    auto const q0 = UDiv128(dnd_hi, 0, n).first;
    return std::make_pair(q0, nbits);
}

template <bool Use128 = true, bool Adjust = true>
std::pair<u64, u64> BarrettDivMod64(u64 x_hi, u64 x_lo, u64 n, u64 r, size_t s) {
    // ((x_hi * 2^64 + x_lo) * (2^64 + r)) >> (64 + s)
    
    DASSERT(x_hi < n);
    
    #if !IS_MSVC
    if constexpr(Use128) {
        using u128 = unsigned __int128;
        u128 const xf = (u128(x_hi) << 64) | x_lo;
        u64 q = u64((u128(x_hi) * r + xf + u64((u128(x_lo) * r) >> 64)) >> s);
        if (s < 64) {
            u64 t = x_lo - q * n;
            if constexpr(Adjust) {
                u64 const mask = ~u64(i64(t - n) >> 63);
                q += mask & 1;
                t -= mask & n;
            }
            return std::make_pair(q, t);
        } else {
            u128 t = xf - u128(q) * n;
            return t < n ? std::make_pair(q, u64(t)) : std::make_pair(q + 1, u64(t) - n);
        }
    } else
    #endif
    {
        auto const w1a = UMul128(x_lo, r).first;
        auto const [w2b, w1b] = UMul128(x_hi, r);
        auto const w2c = x_hi, w1c = x_lo;
        u64 w1, w2 = _addcarry_u64(0, w1a, w1b, (UllPtr)&w1);
        w2 += _addcarry_u64(0, w1, w1c, (UllPtr)&w1);
        w2 += w2b + w2c;
        
        if (s < 64) {
            u64 q = (w2 << (64 - s)) | (w1 >> s);
            u64 t = x_lo - q * n;
            if constexpr(Adjust) {
                u64 const mask = ~u64(i64(t - n) >> 63);
                q += mask & 1;
                t -= mask & n;
            }
            return std::make_pair(q, t);
        } else {
            u64 const q = w2;
            auto const [b_hi, b_lo] = UMul128(q, n);
            auto const [t_hi, t_lo] = USub128(x_hi, x_lo, b_hi, b_lo);
            return t_hi != 0 || t_lo >= n ? std::make_pair(q + 1, t_lo - n) : std::make_pair(q, t_lo);
        }
    }
}

#include <random>
#include <iomanip>
#include <iostream>
#include <chrono>

void TestBarrett() {
    std::mt19937_64 rng{123}; //{std::random_device{}()};
    for (size_t i = 0; i < (1 << 11); ++i) {
        size_t const nbits = rng() % 63 + 2;
        u64 n = 0;
        do {
            n = (u64(1) << (nbits - 1)) + rng() % (u64(1) << (nbits - 1));
        } while (!(n >= 3 && (n & (n - 1)) != 0));
        auto const [br, bs] = BarrettRS64(n);
        for (size_t j = 0; j < (1 << 6); ++j) {
            u64 const hi = rng() % n, lo = rng();
            auto const [ref_q, ref_r] = UDiv128(hi, lo, n);
            u64 bar_q = 0, bar_r = 0;
            for (size_t k = 0; k < 2; ++k) {
                bar_q = 0; bar_r = 0;
                if (k == 0)
                    std::tie(bar_q, bar_r) = BarrettDivMod64<true>(hi, lo, n, br, bs);
                else
                    std::tie(bar_q, bar_r) = BarrettDivMod64<false>(hi, lo, n, br, bs);
                ASSERT_MSG(bar_q == ref_q && bar_r == ref_r, "i " + std::to_string(i) + ", j " + std::to_string(j) + ", k " + std::to_string(k) +
                    ", nbits " + std::to_string(nbits) + ", n " + std::to_string(n) + ", bar_q " + std::to_string(bar_q) +
                    ", ref_q " + std::to_string(ref_q) + ", bar_r " + std::to_string(bar_r) + ", ref_r " + std::to_string(ref_r));
            }
        }
    }
}

class bi_uint
{
public:
    using u64_t = std::uint64_t;
    constexpr static std::size_t u64_bits = 8 * sizeof(u64_t);

    //little-endian
    std::vector<u64_t> data;

    static auto constexpr DefPrep = [](auto n){
        return std::make_pair(false, false);
    };
    static auto constexpr DefDivMod = [](auto dnd_hi, auto dnd_lo, auto dsr, auto const & prep){
        return UDiv128(dnd_hi, dnd_lo, dsr);
    };
    
    //User should guarantee data.size()>0 and val>0
    template <typename PrepT = decltype(DefPrep), typename DivModT = decltype(DefDivMod)>
    void self_div(const u64_t val, PrepT const & Prep = DefPrep, DivModT const & DivMod = DefDivMod)
    {
        auto it = data.rbegin();

        if(data.size() == 1) {
            *it /= val;
            return;
        }    
        
        u64_t rem_hi = 0, rem_lo = 0;
        if(*it < val) {
            rem_lo = *it++;
            //data.pop_back();
        }
        
        auto const prep = Prep(val);
        
        u64_t r = rem_lo % val;
        u64_t q = 0;
        while(it != data.rend()) {
            rem_hi = r;
            rem_lo = *it;
            std::tie(q, r) = DivMod(rem_hi, rem_lo, val, prep);
            //*it++ = static_cast<u64_t>(q);
            it++;
            auto volatile out = static_cast<u64_t>(q);
        }
    }
};

void TestSpeed() {
    auto Time = []{
        static auto const gtb = std::chrono::high_resolution_clock::now();
        return std::chrono::duration_cast<std::chrono::duration<double>>(
            std::chrono::high_resolution_clock::now() - gtb).count();
    };
    std::mt19937_64 rng{123};
    std::vector<u64> limbs, divisors;
    for (size_t i = 0; i < (1 << 17); ++i)
        limbs.push_back(rng());
    for (size_t i = 0; i < (1 << 8); ++i) {
        size_t const nbits = rng() % 63 + 2;
        u64 n = 0;
        do {
            n = (u64(1) << (nbits - 1)) + rng() % (u64(1) << (nbits - 1));
        } while (!(n >= 3 && (n & (n - 1)) != 0));
        divisors.push_back(n);
    }
    std::cout << std::fixed << std::setprecision(3);
    double div_time = 0;
    {
        bi_uint x;
        x.data = limbs;
        auto const tim = Time();
        for (auto dsr: divisors)
            x.self_div(dsr);
        div_time = Time() - tim;
        std::cout << "Divide time " << div_time << " sec" << std::endl;
    }
    {
        bi_uint x;
        x.data = limbs;
        for (size_t i = 0; i < 2; ++i) {
            if (IS_MSVC && i == 0)
                continue;
            auto const tim = Time();
            for (auto dsr: divisors)
                x.self_div(dsr, [](auto n){ return BarrettRS64(n); },
                    [i](auto dnd_hi, auto dnd_lo, auto dsr, auto const & prep){
                        return i == 0 ? BarrettDivMod64<true>(dnd_hi, dnd_lo, dsr, prep.first, prep.second) :
                            BarrettDivMod64<false>(dnd_hi, dnd_lo, dsr, prep.first, prep.second);
                    });
            double const bar_time = Time() - tim;
            std::cout << "Barrett" << (i == 0 ? "128" : "64 ") << " time " << bar_time << " sec, boost " << div_time / bar_time << std::endl;
        }
    }
}

int main() {
    try {
        TestBarrett();
        TestSpeed();
        return 0;
    } catch (std::exception const & ex) {
        std::cout << "Exception: " << ex.what() << std::endl;
        return -1;
    }
}

Run Code Online (Sandbox Code Playgroud)

输出：

Divide time 3.171 sec
Barrett128 time 0.675 sec, boost 4.695
Barrett64  time 0.642 sec, boost 4.937

Run Code Online (Sandbox Code Playgroud)

第2部分

由于你有一个非常有趣的问题，在我第一次发表这篇文章几天后，我决定从头开始实现所有大整数数学。

下面的代码实现了+, -, *, /, <<, >>自然大数（正整数）和+, -, *, /浮点大数的数学运算。两种类型的数字都是任意大小（甚至数百万位）。除了您要求的之外，我还完全实现了Newton-Raphson（平方和立方变体）和Goldschmidt快速除法算法。

以下是仅适用于 Newton-Raphson/Golschmidt 函数的代码片段，由于剩余代码非常大，因此在外部服务器上链接如下：

Divide time 3.171 sec
Barrett128 time 0.675 sec, boost 4.695
Barrett64  time 0.642 sec, boost 4.937

Run Code Online (Sandbox Code Playgroud)

请参见Output:下文，它将显示 Newton-Raphson 和 Goldschmidt 方法实际上比常规学校成绩（在输出中调用）算法慢10Reference倍。这 3 种高级算法之间的速度大致相同。如果使用快速傅里叶变换进行乘法， Raphson/Goldschmidt 可能会更快，因为两个大数的乘法占用了这些算法 95% 的时间。在下面的代码中，Raphson/Goldschmidt 算法的所有结果不仅是时间测量的，而且还检查了与学校成绩（参考）算法相比结果的正确性（请参见diff 2^...控制台输出，这显示了结果与学校成绩相比有多大差异）。

完整源代码在这里。完整的代码是如此之大，以至于由于每篇文章 30 000 个字符的限制，它不适合这个 StackOverflow，尽管我专门为这篇文章从头开始编写了这段代码。这就是为什么提供外部下载链接（PasteBin 服务器），也单击Try it online!下面的链接，它是在 GodBolt 的 Linux 服务器上实时运行的相同代码副本：

在线尝试一下！

输出：

==========     1 K bits   ==========
Reference    0.000029 sec
Raphson2     0.000066 sec, boost 0.440x, diff 2^-8192
Raphson3     0.000092 sec, boost 0.317x, diff 2^-8192
Goldschmidt  0.000080 sec, boost 0.365x, diff 2^-1022

==========     2 K bits   ==========
Reference    0.000071 sec
Raphson2     0.000177 sec, boost 0.400x, diff 2^-16384
Raphson3     0.000283 sec, boost 0.250x, diff 2^-16384
Goldschmidt  0.000388 sec, boost 0.182x, diff 2^-2046

==========     4 K bits   ==========
Reference    0.000319 sec
Raphson2     0.000875 sec, boost 0.365x, diff 2^-4094
Raphson3     0.001122 sec, boost 0.285x, diff 2^-32768
Goldschmidt  0.000881 sec, boost 0.362x, diff 2^-32768

==========     8 K bits   ==========
Reference    0.000484 sec
Raphson2     0.002281 sec, boost 0.212x, diff 2^-65536
Raphson3     0.002341 sec, boost 0.207x, diff 2^-65536
Goldschmidt  0.002432 sec, boost 0.199x, diff 2^-8189

==========    16 K bits   ==========
Reference    0.001199 sec
Raphson2     0.009042 sec, boost 0.133x, diff 2^-16382
Raphson3     0.009519 sec, boost 0.126x, diff 2^-131072
Goldschmidt  0.009047 sec, boost 0.133x, diff 2^-16380

==========    32 K bits   ==========
Reference    0.004311 sec
Raphson2     0.039151 sec, boost 0.110x, diff 2^-32766
Raphson3     0.041058 sec, boost 0.105x, diff 2^-262144
Goldschmidt  0.045517 sec, boost 0.095x, diff 2^-32764

==========    64 K bits   ==========
Reference    0.016273 sec
Raphson2     0.165656 sec, boost 0.098x, diff 2^-524288
Raphson3     0.210301 sec, boost 0.077x, diff 2^-65535
Goldschmidt  0.208081 sec, boost 0.078x, diff 2^-65534

==========   128 K bits   ==========
Reference    0.059469 sec
Raphson2     0.725865 sec, boost 0.082x, diff 2^-1048576
Raphson3     0.735530 sec, boost 0.081x, diff 2^-1048576
Goldschmidt  0.703991 sec, boost 0.084x, diff 2^-131069

==========   256 K bits   ==========
Reference    0.326368 sec
Raph

归档时间：	3 年，10 月前
查看次数：	549 次
最近记录：	3 年，9 月前