更快的16位乘法算法,适用于8位MCU

Question

更快的16位乘法算法,适用于8位MCU

Sir*_*ack 21 c c++ algorithm microcontroller avr

我正在寻找一种算法来乘以两个比下面更好的整数.你有一个好主意吗？(MCU - AT Tiny 84/85或类似 - 此代码运行的地方没有mul/div运算符)

uint16_t umul16_(uint16_t a, uint16_t b)
{
    uint16_t res=0;

    while (b) {
        if ( (b & 1) )
            res+=a;
        b>>=1;
        a+=a;
    }

    return res;
}

Run Code Online (Sandbox Code Playgroud)

当使用avr-gcc编译器为AT Tiny 85/84编译时,该算法几乎与avr-gcc生成的算法__mulhi3相同.

avr-gcc算法:

00000106 <__mulhi3>:
 106:   00 24           eor r0, r0
 108:   55 27           eor r21, r21
 10a:   04 c0           rjmp    .+8         ; 0x114 <__mulhi3+0xe>
 10c:   08 0e           add r0, r24
 10e:   59 1f           adc r21, r25
 110:   88 0f           add r24, r24
 112:   99 1f           adc r25, r25
 114:   00 97           sbiw    r24, 0x00   ; 0
 116:   29 f0           breq    .+10        ; 0x122 <__mulhi3+0x1c>
 118:   76 95           lsr r23
 11a:   67 95           ror r22
 11c:   b8 f3           brcs    .-18        ; 0x10c <__mulhi3+0x6>
 11e:   71 05           cpc r23, r1
 120:   b9 f7           brne    .-18        ; 0x110 <__mulhi3+0xa>
 122:   80 2d           mov r24, r0
 124:   95 2f           mov r25, r21
 126:   08 95           ret

Run Code Online (Sandbox Code Playgroud)

umul16_算法:

00000044 <umul16_>:
  44:   20 e0           ldi r18, 0x00   ; 0
  46:   30 e0           ldi r19, 0x00   ; 0
  48:   61 15           cp  r22, r1
  4a:   71 05           cpc r23, r1
  4c:   49 f0           breq    .+18        ; 0x60 <umul16_+0x1c>
  4e:   60 ff           sbrs    r22, 0
  50:   02 c0           rjmp    .+4         ; 0x56 <umul16_+0x12>
  52:   28 0f           add r18, r24
  54:   39 1f           adc r19, r25
  56:   76 95           lsr r23
  58:   67 95           ror r22
  5a:   88 0f           add r24, r24
  5c:   99 1f           adc r25, r25
  5e:   f4 cf           rjmp    .-24        ; 0x48 <umul16_+0x4>
  60:   c9 01           movw    r24, r18
  62:   08 95           ret

Run Code Online (Sandbox Code Playgroud)

编辑:指令集在此处可用.

Answer 1

Ant*_*nio 17

摘要

考虑交换a和b(原始提案)
试图避免条件跳转(未成功优化)
重塑输入公式(估计增加35%)
删除重复的班次
展开循环:"最佳"装配
说服编译器提供最佳装配

1.考虑交换a和b

一个改进是首先比较a和b,并在以下情况下交换它们a<b:你应该使用b两者中较小的一个,这样你就有了最小的周期数.请注意,您可以通过复制代码来避免交换(if (a<b)然后跳转到镜像代码部分),但我怀疑它是否值得.

2.试图避免条件跳转(未成功优化)

尝试:

uint16_t umul16_(uint16_t a, uint16_t b)
{
    ///Here swap if necessary
    uint16_t accum=0;

    while (b) {
        accum += ((b&1) * uint16_t(0xffff)) & a; //Hopefully this multiplication is optimized away
        b>>=1;
        a+=a;
    }

    return accum;
}

Run Code Online (Sandbox Code Playgroud)

从塞尔吉奥的反馈来看,这并未带来改善.

3.重塑输入公式

考虑到目标体系结构基本上只有8位指令,如果将输入变量的上部和下部8位分开,则可以编写:

a = a1 * 0xff + a0;
b = b1 * 0xff + b0;

a * b = a1 * b1 * 0xffff + a0 * b1 * 0xff + a1 * b0 * 0xff + a0 * b0

Run Code Online (Sandbox Code Playgroud)

现在,很酷的是我们可以抛弃这个术语a1 * b1 * 0xffff,因为它0xffff 会将它发送出你的注册表.

(16bit) a * b = a0 * b1 * 0xff + a1 * b0 * 0xff + a0 * b0

Run Code Online (Sandbox Code Playgroud)

此外,a0*b1和a1*b0项可以被视为8位乘法,因为0xff:任何超过256的部分都将被发送出寄存器.

至今令人兴奋!...但是,现实引人注目:a0 * b0必须将其视为16位乘法,因为您必须保留所有结果位.a0必须保持在16位以允许换挡.这个乘法有一半的迭代a * b,它是部分 8位(因为b0),但你仍然必须考虑前面提到的2 8位乘法,以及最终的结果组合.我们需要进一步重塑!

所以现在我收集b0.

(16bit) a * b = a0 * b1 * 0xff + b0 * (a0 + a1 * 0xff)

Run Code Online (Sandbox Code Playgroud)

但

(a0 + a1 * 0xff) = a

Run Code Online (Sandbox Code Playgroud)

所以我们得到:

(16bit) a * b = a0 * b1 * 0xff + b0 * a

Run Code Online (Sandbox Code Playgroud)

如果N是原始的周期,那么a * b现在第一项是8位乘法,具有N/2个周期,第二项是16位*8位乘法,具有N/2个周期.考虑M原始中每次迭代的指令数a*b,8位*8位迭代具有一半指令,而16位*8位约为M的80%(与b相比,b0的一个移位指令更少).综合起来,我们有了N/2*M/2+N/2*M*0.8 = N*M*0.65复杂性,因此相对于原始设备可以节省约35%N*M.听起来很有希望.

这是代码:

uint16_t umul16_(uint16_t a, uint16_t b)
{
    uint8_t res1 = 0;

    uint8_t a0 = a & 0xff; //This effectively needs to copy the data
    uint8_t b0 = b & 0xff; //This should be optimized away
    uint8_t b1 = b >>8; //This should be optimized away

    //Here a0 and b1 could be swapped (to have b1 < a0)
    while (b1) {///Maximum 8 cycles
        if ( (b1 & 1) )
            res1+=a0;
        b1>>=1;
        a0+=a0;
    }

    uint16_t res = (uint16_t) res1 * 256; //Should be optimized away, it's not even a copy!

    //Here swapping wouldn't make much sense
    while (b0) {///Maximum 8 cycles
        if ( (b0 & 1) )
            res+=a;
        b0>>=1;
        a+=a;
    }

    return res;
}

Run Code Online (Sandbox Code Playgroud)

此外,理论上,在2个周期内的分裂应该是跳过一些周期的机会:N/2可能略微过高估计.

一个微小的进一步改进包括避免a变量的最后一次不必要的转变.小旁注:如果b0或b1为零,则会产生2条额外指令.但它也保存了b0和b1的第一次检查,这是最昂贵的,因为它无法检查zero flagfor循环的条件跳转的移位操作的状态.

uint16_t umul16_(uint16_t a, uint16_t b)
{
    uint8_t res1 = 0;

    uint8_t a0 = a & 0xff; //This effectively needs to copy the data
    uint8_t b0 = b & 0xff; //This should be optimized away
    uint8_t b1 = b >>8; //This should be optimized away

    //Here a0 and b1 could be swapped (to have b1 < a0)
    if ( (b1 & 1) )
        res1+=a0;
    b1>>=1;
    while (b1) {///Maximum 7 cycles
        a0+=a0;
        if ( (b1 & 1) )
            res1+=a0;
        b1>>=1;
    }

    uint16_t res = (uint16_t) res1 * 256; //Should be optimized away, it's not even a copy!

    //Here swapping wouldn't make much sense
    if ( (b0 & 1) )
        res+=a;
    b0>>=1;
    while (b0) {///Maximum 7 cycles
        a+=a;
        if ( (b0 & 1) )
            res+=a;
        b0>>=1;
    }

    return res;
}

Run Code Online (Sandbox Code Playgroud)

4.删除重复的班次

还有改进的空间吗？是的,因为字节数a0被移动了两次.所以组合两个循环应该有好处.说服编译器完全按照我们想要的方式执行操作可能有点棘手,特别是对于结果寄存器.

因此,我们处理在同一周期b0和b1.首先要处理的是循环退出条件？到目前为止,使用b0/ b1清除状态一直很方便,因为它避免使用计数器.此外,在右移之后,如果操作结果为零,则可能已经设置了标志,并且该标志可能允许条件跳转而无需进一步评估.

现在循环退出条件可能是失败的(b0 || b1).然而,这可能需要昂贵的计算.一种解决方案是比较b0和b1并跳转到2个不同的代码段:如果b1 > b0我测试条件b1,否则我测试条件b0.我更喜欢另一个解决方案,有2个循环,第一个退出时b0为零,第二个退出时b1为零.在某些情况下,我将进行零迭代b1.关键是在第二个循环中我知道b0为零,所以我可以减少执行的操作次数.

现在,让我们忘记退出条件并尝试加入上一节的2个循环.

uint16_t umul16_(uint16_t a, uint16_t b)
{
    uint16_t res = 0;

    uint8_t b0 = b & 0xff; //This should be optimized away
    uint8_t b1 = b >>8; //This should be optimized away

    //Swapping probably doesn't make much sense anymore
    if ( (b1 & 1) )
        res+=(uint16_t)((uint8_t)(a && 0xff))*256;
    //Hopefully the compiler understands it has simply to add the low 8bit register of a to the high 8bit register of res

    if ( (b0 & 1) )
        res+=a;

    b1>>=1;
    b0>>=1;
    while (b0) {///N cycles, maximum 7
        a+=a;
        if ( (b1 & 1) )
            res+=(uint16_t)((uint8_t)(a & 0xff))*256;
        if ( (b0 & 1) )
            res+=a;
        b1>>=1;
        b0>>=1; //I try to put as last the one that will leave the carry flag in the desired state
    }

    uint8_t a0 = a & 0xff; //Again, not a real copy but a register selection

    while (b1) {///P cycles, maximum 7 - N cycles
        a0+=a0;
        if ( (b1 & 1) )
            res+=(uint16_t) a0 * 256;
        b1>>=1;
    }
    return res;
}

Run Code Online (Sandbox Code Playgroud)

感谢Sergio提供生成的程序集(-Ofast).乍一看,考虑到mov代码中的大量内容,似乎编译器没有解释,因为我想要他给他解释寄存器的提示.

输入为:r22,r23和r24,25.
AVR指令集:快速参考,详细文档

sbrs //Tests a single bit in a register and skips the next instruction if the bit is set. Skip takes 2 clocks. 
ldi // Load immediate, 1 clock
sbiw // Subtracts immediate to *word*, 2 clocks

    00000010 <umul16_Antonio5>:
      10:    70 ff           sbrs    r23, 0
      12:    39 c0           rjmp    .+114        ; 0x86 <__SREG__+0x47>
      14:    41 e0           ldi    r20, 0x01    ; 1
      16:    00 97           sbiw    r24, 0x00    ; 0
      18:    c9 f1           breq    .+114        ; 0x8c <__SREG__+0x4d>
      1a:    34 2f           mov    r19, r20
      1c:    20 e0           ldi    r18, 0x00    ; 0
      1e:    60 ff           sbrs    r22, 0
      20:    07 c0           rjmp    .+14         ; 0x30 <umul16_Antonio5+0x20>
      22:    28 0f           add    r18, r24
      24:    39 1f           adc    r19, r25
      26:    04 c0           rjmp    .+8          ; 0x30 <umul16_Antonio5+0x20>
      28:    e4 2f           mov    r30, r20
      2a:    45 2f           mov    r20, r21
      2c:    2e 2f           mov    r18, r30
      2e:    34 2f           mov    r19, r20
      30:    76 95           lsr    r23
      32:    66 95           lsr    r22
      34:    b9 f0           breq    .+46         ; 0x64 <__SREG__+0x25>
      36:    88 0f           add    r24, r24
      38:    99 1f           adc    r25, r25
      3a:    58 2f           mov    r21, r24
      3c:    44 27           eor    r20, r20
      3e:    42 0f           add    r20, r18
      40:    53 1f           adc    r21, r19
      42:    70 ff           sbrs    r23, 0
      44:    02 c0           rjmp    .+4          ; 0x4a <__SREG__+0xb>
      46:    24 2f           mov    r18, r20
      48:    35 2f           mov    r19, r21
      4a:    42 2f           mov    r20, r18
      4c:    53 2f           mov    r21, r19
      4e:    48 0f           add    r20, r24
      50:    59 1f           adc    r21, r25
      52:    60 fd           sbrc    r22, 0
      54:    e9 cf           rjmp    .-46         ; 0x28 <umul16_Antonio5+0x18>
      56:    e2 2f           mov    r30, r18
      58:    43 2f           mov    r20, r19
      5a:    e8 cf           rjmp    .-48         ; 0x2c <umul16_Antonio5+0x1c>
      5c:    95 2f           mov    r25, r21
      5e:    24 2f           mov    r18, r20
      60:    39 2f           mov    r19, r25
      62:    76 95           lsr    r23
      64:    77 23           and    r23, r23
      66:    61 f0           breq    .+24         ; 0x80 <__SREG__+0x41>
      68:    88 0f           add    r24, r24
      6a:    48 2f           mov    r20, r24
      6c:    50 e0           ldi    r21, 0x00    ; 0
      6e:    54 2f           mov    r21, r20
      70:    44 27           eor    r20, r20
      72:    42 0f           add    r20, r18
      74:    53 1f           adc    r21, r19
      76:    70 fd           sbrc    r23, 0
      78:    f1 cf           rjmp    .-30         ; 0x5c <__SREG__+0x1d>
      7a:    42 2f           mov    r20, r18
      7c:    93 2f           mov    r25, r19
      7e:    ef cf           rjmp    .-34         ; 0x5e <__SREG__+0x1f>
      80:    82 2f           mov    r24, r18
      82:    93 2f           mov    r25, r19
      84:    08 95           ret
      86:    20 e0           ldi    r18, 0x00    ; 0
      88:    30 e0           ldi    r19, 0x00    ; 0
      8a:    c9 cf           rjmp    .-110        ; 0x1e <umul16_Antonio5+0xe>
      8c:    40 e0           ldi    r20, 0x00    ; 0
      8e:    c5 cf           rjmp    .-118        ; 0x1a <umul16_Antonio5+0xa>

Run Code Online (Sandbox Code Playgroud)

5.展开循环:"最佳"装配

有了所有这些信息,让我们试着理解在给定架构限制的情况下什么是"最优"解决方案.引用"最佳"是因为什么是"最优"取决于输入数据和我们想要优化的内容.假设我们想要在最坏情况下优化周期数.如果我们选择最坏的情况,循环展开是一个合理的选择:我们知道我们有8个循环,我们删除所有测试以了解我们是否完成(如果b0和b1为零).到目前为止,我们使用了"我们转移,我们检查零标志"的技巧来检查我们是否必须退出循环.删除了这个要求,我们可以使用一个不同的技巧:我们移位,并检查进位位(我们在移位时从寄存器发出的位)以了解我是否应该更新结果.给定指令集,在汇编"叙述"代码中,指令变为以下内容.

//Input: a = a1 * 256 + a0, b = b1 * 256 + b0
//Output: r = r1 * 256 + r0

Preliminary:
P0 r0 = 0 (CLR)
P1 r1 = 0 (CLR)

Main block:
0 Shift right b0 (LSR)
1 If carry is not set skip 2 instructions = jump to 4 (BRCC)
2 r0 = r0 + a0 (ADD)
3 r1 = r1 + a1 + carry from prev. (ADC)
4 Shift right b1 (LSR)
5 If carry is not set skip 1 instruction = jump to 7 (BRCC)
6 r1 = r1 + a0 (ADD)
7 a0 = a0 + a0 (ADD)  
8 a1 = a1 + a1 + carry from prev. (ADC)

[Repeat same instructions for another 7 times]

Run Code Online (Sandbox Code Playgroud)

如果没有引起跳转,则分支需要1条指令,否则为2条.所有其他指令都是1个周期.因此b1状态对循环次数没有影响,而如果b0 = 1则有9个循环,如果b0 = 0则有8个循环.计算初始化,8次迭代并跳过a0和a1的最后更新,在更坏的情况下(b0 = 11111111b),我们共有8 * 9 + 2 - 2 = 72个周期.我不知道哪个C++实现会说服编译器生成它.也许:

 void iterate(uint8_t& b0,uint8_t& b1,uint16_t& a, uint16_t& r) {
     const uint8_t temp0 = b0;
     b0 >>=1;
     if (temp0 & 0x01) {//Will this convince him to use the carry flag?
         r += a;
     }
     const uint8_t temp1 = b1;
     b1 >>=1;
     if (temp1 & 0x01) {
         r+=(uint16_t)((uint8_t)(a & 0xff))*256;
     }
     a += a;
 }

 uint16_t umul16_(uint16_t a, uint16_t b) {
     uint16_t r = 0;
     uint8_t b0 = b & 0xff;
     uint8_t b1 = b >>8;

     iterate(b0,b1,a,r);
     iterate(b0,b1,a,r);
     iterate(b0,b1,a,r);
     iterate(b0,b1,a,r);
     iterate(b0,b1,a,r);
     iterate(b0,b1,a,r);
     iterate(b0,b1,a,r);
     iterate(b0,b1,a,r); //Hopefully he understands he doesn't need the last update for variable a
     return r;
 }

Run Code Online (Sandbox Code Playgroud)

但是,鉴于之前的结果,要真正获得所需的代码,应该真正切换到汇编!

最后,我们还可以考虑对循环展开的更极端的解释:sbrc/sbrs指令允许测试寄存器的特定位.因此,我们可以避免移位b0和b1,并且在每个周期检查不同的位.唯一的问题是这些指令只允许跳过下一条指令,而不是自定义跳转.所以,在"叙事代码"中,它看起来像这样:

Main block:
0 Test Nth bit of b0 (SBRS). If set jump to 2 (+ 1cycle) otherwise continue with 1
1 Jump to 4 (RJMP)
2 r0 = r0 + a0 (ADD)
3 r1 = r1 + a1 + carry from prev. (ADC)
4 Test Nth bit of (SBRC). If cleared jump to 6 (+ 1cycle) otherwise continue with 5
5 r1 = r1 + a0 (ADD)
6 a0 = a0 + a0 (ADD)  
7 a1 = a1 + a1 + carry from prev. (ADC)

Run Code Online (Sandbox Code Playgroud)

虽然第二次替换允许节省1个周期,但在第二次替换中没有明显的优势.但是,我相信C++代码可能更容易为编译器解释.考虑到8个周期,初始化和跳过a0和a1的最后更新,我们现在有64个周期.

C++代码:

 template<uint8_t mask>
 void iterateWithMask(const uint8_t& b0,const uint8_t& b1, uint16_t& a, uint16_t& r) {
     if (b0 & mask)
         r += a;
     if (b1 & mask)
         r+=(uint16_t)((uint8_t)(a & 0xff))*256;
     a += a;
 }

 uint16_t umul16_(uint16_t a, const uint16_t b) {
     uint16_t r = 0;
     const uint8_t b0 = b & 0xff;
     const uint8_t b1 = b >>8;

     iterateWithMask<0x01>(b0,b1,a,r);
     iterateWithMask<0x02>(b0,b1,a,r);
     iterateWithMask<0x04>(b0,b1,a,r);
     iterateWithMask<0x08>(b0,b1,a,r);
     iterateWithMask<0x10>(b0,b1,a,r);
     iterateWithMask<0x20>(b0,b1,a,r);
     iterateWithMask<0x40>(b0,b1,a,r);
     iterateWithMask<0x80>(b0,b1,a,r);

     //Hopefully he understands he doesn't need the last update for a
     return r;
 }

Run Code Online (Sandbox Code Playgroud)

请注意,在此实现中,0x01,0x02不是实际值,而只是提示编译器知道要测试哪个位.因此,无法通过向右移动获得掩码:与目前为止看到的所有其他函数不同,这实际上没有等效的循环版本.

一个大问题是

r+=(uint16_t)((uint8_t)(a & 0xff))*256;

Run Code Online (Sandbox Code Playgroud)

它应该只是r低位寄存器的高位寄存器的总和a.不按我的意愿解释.其他选择:

r+=(uint16_t) 256 *((uint8_t)(a & 0xff));

Run Code Online (Sandbox Code Playgroud)

6.说服编译器提供最佳装配

我们也可以保持a不变,转而改变结果r.在这种情况下,我们b从最重要的位开始处理.复杂性是等价的,但编译器可能更容易消化.此外,这次我们必须小心地明确地写出最后一个循环,它不能再进一步向右移动r.

 template<uint8_t mask>
 void inverseIterateWithMask(const uint8_t& b0,const uint8_t& b1,const uint16_t& a, const uint8_t& a0, uint16_t& r) {
     if (b0 & mask)
         r += a;
     if (b1 & mask)
         r+=(uint16_t)256*a0; //Hopefully easier to understand for the compiler?
     r += r;
 }

 uint16_t umul16_(const uint16_t a, const uint16_t b) {
     uint16_t r = 0;
     const uint8_t b0 = b & 0xff;
     const uint8_t b1 = b >>8;
     const uint8_t a0 = a & 0xff;

     inverseIterateWithMask<0x80>(b0,b1,a,r);
     inverseIterateWithMask<0x40>(b0,b1,a,r);
     inverseIterateWithMask<0x20>(b0,b1,a,r);
     inverseIterateWithMask<0x10>(b0,b1,a,r);
     inverseIterateWithMask<0x08>(b0,b1,a,r);
     inverseIterateWithMask<0x04>(b0,b1,a,r);
     inverseIterateWithMask<0x02>(b0,b1,a,r);

     //Last iteration:
     if (b0 & 0x01)
         r += a;
     if (b1 & 0x01)
         r+=(uint16_t)256*a0;

     return r;
 }

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年，10 月前
查看次数：	1742 次
最近记录：	10 年，8 月前