Sir*_*ack 21 c c++ algorithm microcontroller avr
我正在寻找一种算法来乘以两个比下面更好的整数.你有一个好主意吗?(MCU - AT Tiny 84/85或类似 - 此代码运行的地方没有mul/div运算符)
uint16_t umul16_(uint16_t a, uint16_t b)
{
uint16_t res=0;
while (b) {
if ( (b & 1) )
res+=a;
b>>=1;
a+=a;
}
return res;
}
Run Code Online (Sandbox Code Playgroud)
当使用avr-gcc编译器为AT Tiny 85/84编译时,该算法几乎与avr-gcc生成的算法__mulhi3相同.
avr-gcc算法:
00000106 <__mulhi3>:
106: 00 24 eor r0, r0
108: 55 27 eor r21, r21
10a: 04 c0 rjmp .+8 ; 0x114 <__mulhi3+0xe>
10c: 08 0e add r0, r24
10e: 59 1f adc r21, r25
110: 88 0f add r24, r24
112: 99 1f adc r25, r25
114: 00 97 sbiw r24, 0x00 ; 0
116: 29 f0 breq .+10 ; 0x122 <__mulhi3+0x1c>
118: 76 95 lsr r23
11a: 67 95 ror r22
11c: b8 f3 brcs .-18 ; 0x10c <__mulhi3+0x6>
11e: 71 05 cpc r23, r1
120: b9 f7 brne .-18 ; 0x110 <__mulhi3+0xa>
122: 80 2d mov r24, r0
124: 95 2f mov r25, r21
126: 08 95 ret
Run Code Online (Sandbox Code Playgroud)
umul16_算法:
00000044 <umul16_>:
44: 20 e0 ldi r18, 0x00 ; 0
46: 30 e0 ldi r19, 0x00 ; 0
48: 61 15 cp r22, r1
4a: 71 05 cpc r23, r1
4c: 49 f0 breq .+18 ; 0x60 <umul16_+0x1c>
4e: 60 ff sbrs r22, 0
50: 02 c0 rjmp .+4 ; 0x56 <umul16_+0x12>
52: 28 0f add r18, r24
54: 39 1f adc r19, r25
56: 76 95 lsr r23
58: 67 95 ror r22
5a: 88 0f add r24, r24
5c: 99 1f adc r25, r25
5e: f4 cf rjmp .-24 ; 0x48 <umul16_+0x4>
60: c9 01 movw r24, r18
62: 08 95 ret
Run Code Online (Sandbox Code Playgroud)
Ant*_*nio 17
摘要
a和b(原始提案)a和b
一个改进是首先比较a和b,并在以下情况下交换它们a<b:你应该使用b两者中较小的一个,这样你就有了最小的周期数.请注意,您可以通过复制代码来避免交换(if (a<b)然后跳转到镜像代码部分),但我怀疑它是否值得.
尝试:
uint16_t umul16_(uint16_t a, uint16_t b)
{
///Here swap if necessary
uint16_t accum=0;
while (b) {
accum += ((b&1) * uint16_t(0xffff)) & a; //Hopefully this multiplication is optimized away
b>>=1;
a+=a;
}
return accum;
}
Run Code Online (Sandbox Code Playgroud)
从塞尔吉奥的反馈来看,这并未带来改善.
考虑到目标体系结构基本上只有8位指令,如果将输入变量的上部和下部8位分开,则可以编写:
a = a1 * 0xff + a0;
b = b1 * 0xff + b0;
a * b = a1 * b1 * 0xffff + a0 * b1 * 0xff + a1 * b0 * 0xff + a0 * b0
Run Code Online (Sandbox Code Playgroud)
现在,很酷的是我们可以抛弃这个术语a1 * b1 * 0xffff,因为它0xffff 会将它发送出你的注册表.
(16bit) a * b = a0 * b1 * 0xff + a1 * b0 * 0xff + a0 * b0
Run Code Online (Sandbox Code Playgroud)
此外,a0*b1和a1*b0项可以被视为8位乘法,因为0xff:任何超过256的部分都将被发送出寄存器.
至今令人兴奋!...但是,现实引人注目:a0 * b0必须将其视为16位乘法,因为您必须保留所有结果位.a0必须保持在16位以允许换挡.这个乘法有一半的迭代a * b,它是部分 8位(因为b0),但你仍然必须考虑前面提到的2 8位乘法,以及最终的结果组合.我们需要进一步重塑!
所以现在我收集b0.
(16bit) a * b = a0 * b1 * 0xff + b0 * (a0 + a1 * 0xff)
Run Code Online (Sandbox Code Playgroud)
但
(a0 + a1 * 0xff) = a
Run Code Online (Sandbox Code Playgroud)
所以我们得到:
(16bit) a * b = a0 * b1 * 0xff + b0 * a
Run Code Online (Sandbox Code Playgroud)
如果N是原始的周期,那么a * b现在第一项是8位乘法,具有N/2个周期,第二项是16位*8位乘法,具有N/2个周期.考虑M原始中每次迭代的指令数a*b,8位*8位迭代具有一半指令,而16位*8位约为M的80%(与b相比,b0的一个移位指令更少).综合起来,我们有了N/2*M/2+N/2*M*0.8 = N*M*0.65复杂性,因此相对于原始设备可以节省约35%N*M.听起来很有希望.
这是代码:
uint16_t umul16_(uint16_t a, uint16_t b)
{
uint8_t res1 = 0;
uint8_t a0 = a & 0xff; //This effectively needs to copy the data
uint8_t b0 = b & 0xff; //This should be optimized away
uint8_t b1 = b >>8; //This should be optimized away
//Here a0 and b1 could be swapped (to have b1 < a0)
while (b1) {///Maximum 8 cycles
if ( (b1 & 1) )
res1+=a0;
b1>>=1;
a0+=a0;
}
uint16_t res = (uint16_t) res1 * 256; //Should be optimized away, it's not even a copy!
//Here swapping wouldn't make much sense
while (b0) {///Maximum 8 cycles
if ( (b0 & 1) )
res+=a;
b0>>=1;
a+=a;
}
return res;
}
Run Code Online (Sandbox Code Playgroud)
此外,理论上,在2个周期内的分裂应该是跳过一些周期的机会:N/2可能略微过高估计.
一个微小的进一步改进包括避免a变量的最后一次不必要的转变.小旁注:如果b0或b1为零,则会产生2条额外指令.但它也保存了b0和b1的第一次检查,这是最昂贵的,因为它无法检查zero flagfor循环的条件跳转的移位操作的状态.
uint16_t umul16_(uint16_t a, uint16_t b)
{
uint8_t res1 = 0;
uint8_t a0 = a & 0xff; //This effectively needs to copy the data
uint8_t b0 = b & 0xff; //This should be optimized away
uint8_t b1 = b >>8; //This should be optimized away
//Here a0 and b1 could be swapped (to have b1 < a0)
if ( (b1 & 1) )
res1+=a0;
b1>>=1;
while (b1) {///Maximum 7 cycles
a0+=a0;
if ( (b1 & 1) )
res1+=a0;
b1>>=1;
}
uint16_t res = (uint16_t) res1 * 256; //Should be optimized away, it's not even a copy!
//Here swapping wouldn't make much sense
if ( (b0 & 1) )
res+=a;
b0>>=1;
while (b0) {///Maximum 7 cycles
a+=a;
if ( (b0 & 1) )
res+=a;
b0>>=1;
}
return res;
}
Run Code Online (Sandbox Code Playgroud)
还有改进的空间吗?是的,因为字节数a0被移动了两次.所以组合两个循环应该有好处.说服编译器完全按照我们想要的方式执行操作可能有点棘手,特别是对于结果寄存器.
因此,我们处理在同一周期b0和b1.首先要处理的是循环退出条件?到目前为止,使用b0/ b1清除状态一直很方便,因为它避免使用计数器.此外,在右移之后,如果操作结果为零,则可能已经设置了标志,并且该标志可能允许条件跳转而无需进一步评估.
现在循环退出条件可能是失败的(b0 || b1).然而,这可能需要昂贵的计算.一种解决方案是比较b0和b1并跳转到2个不同的代码段:如果b1 > b0我测试条件b1,否则我测试条件b0.我更喜欢另一个解决方案,有2个循环,第一个退出时b0为零,第二个退出时b1为零.在某些情况下,我将进行零迭代b1.关键是在第二个循环中我知道b0为零,所以我可以减少执行的操作次数.
现在,让我们忘记退出条件并尝试加入上一节的2个循环.
uint16_t umul16_(uint16_t a, uint16_t b)
{
uint16_t res = 0;
uint8_t b0 = b & 0xff; //This should be optimized away
uint8_t b1 = b >>8; //This should be optimized away
//Swapping probably doesn't make much sense anymore
if ( (b1 & 1) )
res+=(uint16_t)((uint8_t)(a && 0xff))*256;
//Hopefully the compiler understands it has simply to add the low 8bit register of a to the high 8bit register of res
if ( (b0 & 1) )
res+=a;
b1>>=1;
b0>>=1;
while (b0) {///N cycles, maximum 7
a+=a;
if ( (b1 & 1) )
res+=(uint16_t)((uint8_t)(a & 0xff))*256;
if ( (b0 & 1) )
res+=a;
b1>>=1;
b0>>=1; //I try to put as last the one that will leave the carry flag in the desired state
}
uint8_t a0 = a & 0xff; //Again, not a real copy but a register selection
while (b1) {///P cycles, maximum 7 - N cycles
a0+=a0;
if ( (b1 & 1) )
res+=(uint16_t) a0 * 256;
b1>>=1;
}
return res;
}
Run Code Online (Sandbox Code Playgroud)
感谢Sergio提供生成的程序集(-Ofast).乍一看,考虑到mov代码中的大量内容,似乎编译器没有解释,因为我想要他给他解释寄存器的提示.
输入为:r22,r23和r24,25.
AVR指令集:快速参考,详细文档
sbrs //Tests a single bit in a register and skips the next instruction if the bit is set. Skip takes 2 clocks.
ldi // Load immediate, 1 clock
sbiw // Subtracts immediate to *word*, 2 clocks
00000010 <umul16_Antonio5>:
10: 70 ff sbrs r23, 0
12: 39 c0 rjmp .+114 ; 0x86 <__SREG__+0x47>
14: 41 e0 ldi r20, 0x01 ; 1
16: 00 97 sbiw r24, 0x00 ; 0
18: c9 f1 breq .+114 ; 0x8c <__SREG__+0x4d>
1a: 34 2f mov r19, r20
1c: 20 e0 ldi r18, 0x00 ; 0
1e: 60 ff sbrs r22, 0
20: 07 c0 rjmp .+14 ; 0x30 <umul16_Antonio5+0x20>
22: 28 0f add r18, r24
24: 39 1f adc r19, r25
26: 04 c0 rjmp .+8 ; 0x30 <umul16_Antonio5+0x20>
28: e4 2f mov r30, r20
2a: 45 2f mov r20, r21
2c: 2e 2f mov r18, r30
2e: 34 2f mov r19, r20
30: 76 95 lsr r23
32: 66 95 lsr r22
34: b9 f0 breq .+46 ; 0x64 <__SREG__+0x25>
36: 88 0f add r24, r24
38: 99 1f adc r25, r25
3a: 58 2f mov r21, r24
3c: 44 27 eor r20, r20
3e: 42 0f add r20, r18
40: 53 1f adc r21, r19
42: 70 ff sbrs r23, 0
44: 02 c0 rjmp .+4 ; 0x4a <__SREG__+0xb>
46: 24 2f mov r18, r20
48: 35 2f mov r19, r21
4a: 42 2f mov r20, r18
4c: 53 2f mov r21, r19
4e: 48 0f add r20, r24
50: 59 1f adc r21, r25
52: 60 fd sbrc r22, 0
54: e9 cf rjmp .-46 ; 0x28 <umul16_Antonio5+0x18>
56: e2 2f mov r30, r18
58: 43 2f mov r20, r19
5a: e8 cf rjmp .-48 ; 0x2c <umul16_Antonio5+0x1c>
5c: 95 2f mov r25, r21
5e: 24 2f mov r18, r20
60: 39 2f mov r19, r25
62: 76 95 lsr r23
64: 77 23 and r23, r23
66: 61 f0 breq .+24 ; 0x80 <__SREG__+0x41>
68: 88 0f add r24, r24
6a: 48 2f mov r20, r24
6c: 50 e0 ldi r21, 0x00 ; 0
6e: 54 2f mov r21, r20
70: 44 27 eor r20, r20
72: 42 0f add r20, r18
74: 53 1f adc r21, r19
76: 70 fd sbrc r23, 0
78: f1 cf rjmp .-30 ; 0x5c <__SREG__+0x1d>
7a: 42 2f mov r20, r18
7c: 93 2f mov r25, r19
7e: ef cf rjmp .-34 ; 0x5e <__SREG__+0x1f>
80: 82 2f mov r24, r18
82: 93 2f mov r25, r19
84: 08 95 ret
86: 20 e0 ldi r18, 0x00 ; 0
88: 30 e0 ldi r19, 0x00 ; 0
8a: c9 cf rjmp .-110 ; 0x1e <umul16_Antonio5+0xe>
8c: 40 e0 ldi r20, 0x00 ; 0
8e: c5 cf rjmp .-118 ; 0x1a <umul16_Antonio5+0xa>
Run Code Online (Sandbox Code Playgroud)
有了所有这些信息,让我们试着理解在给定架构限制的情况下什么是"最优"解决方案.引用"最佳"是因为什么是"最优"取决于输入数据和我们想要优化的内容.假设我们想要在最坏情况下优化周期数.如果我们选择最坏的情况,循环展开是一个合理的选择:我们知道我们有8个循环,我们删除所有测试以了解我们是否完成(如果b0和b1为零).到目前为止,我们使用了"我们转移,我们检查零标志"的技巧来检查我们是否必须退出循环.删除了这个要求,我们可以使用一个不同的技巧:我们移位,并检查进位位(我们在移位时从寄存器发出的位)以了解我是否应该更新结果.给定指令集,在汇编"叙述"代码中,指令变为以下内容.
//Input: a = a1 * 256 + a0, b = b1 * 256 + b0
//Output: r = r1 * 256 + r0
Preliminary:
P0 r0 = 0 (CLR)
P1 r1 = 0 (CLR)
Main block:
0 Shift right b0 (LSR)
1 If carry is not set skip 2 instructions = jump to 4 (BRCC)
2 r0 = r0 + a0 (ADD)
3 r1 = r1 + a1 + carry from prev. (ADC)
4 Shift right b1 (LSR)
5 If carry is not set skip 1 instruction = jump to 7 (BRCC)
6 r1 = r1 + a0 (ADD)
7 a0 = a0 + a0 (ADD)
8 a1 = a1 + a1 + carry from prev. (ADC)
[Repeat same instructions for another 7 times]
Run Code Online (Sandbox Code Playgroud)
如果没有引起跳转,则分支需要1条指令,否则为2条.所有其他指令都是1个周期.因此b1状态对循环次数没有影响,而如果b0 = 1则有9个循环,如果b0 = 0则有8个循环.计算初始化,8次迭代并跳过a0和a1的最后更新,在更坏的情况下(b0 = 11111111b),我们共有8 * 9 + 2 - 2 = 72个周期.我不知道哪个C++实现会说服编译器生成它.也许:
void iterate(uint8_t& b0,uint8_t& b1,uint16_t& a, uint16_t& r) {
const uint8_t temp0 = b0;
b0 >>=1;
if (temp0 & 0x01) {//Will this convince him to use the carry flag?
r += a;
}
const uint8_t temp1 = b1;
b1 >>=1;
if (temp1 & 0x01) {
r+=(uint16_t)((uint8_t)(a & 0xff))*256;
}
a += a;
}
uint16_t umul16_(uint16_t a, uint16_t b) {
uint16_t r = 0;
uint8_t b0 = b & 0xff;
uint8_t b1 = b >>8;
iterate(b0,b1,a,r);
iterate(b0,b1,a,r);
iterate(b0,b1,a,r);
iterate(b0,b1,a,r);
iterate(b0,b1,a,r);
iterate(b0,b1,a,r);
iterate(b0,b1,a,r);
iterate(b0,b1,a,r); //Hopefully he understands he doesn't need the last update for variable a
return r;
}
Run Code Online (Sandbox Code Playgroud)
但是,鉴于之前的结果,要真正获得所需的代码,应该真正切换到汇编!
最后,我们还可以考虑对循环展开的更极端的解释:sbrc/sbrs指令允许测试寄存器的特定位.因此,我们可以避免移位b0和b1,并且在每个周期检查不同的位.唯一的问题是这些指令只允许跳过下一条指令,而不是自定义跳转.所以,在"叙事代码"中,它看起来像这样:
Main block:
0 Test Nth bit of b0 (SBRS). If set jump to 2 (+ 1cycle) otherwise continue with 1
1 Jump to 4 (RJMP)
2 r0 = r0 + a0 (ADD)
3 r1 = r1 + a1 + carry from prev. (ADC)
4 Test Nth bit of (SBRC). If cleared jump to 6 (+ 1cycle) otherwise continue with 5
5 r1 = r1 + a0 (ADD)
6 a0 = a0 + a0 (ADD)
7 a1 = a1 + a1 + carry from prev. (ADC)
Run Code Online (Sandbox Code Playgroud)
虽然第二次替换允许节省1个周期,但在第二次替换中没有明显的优势.但是,我相信C++代码可能更容易为编译器解释.考虑到8个周期,初始化和跳过a0和a1的最后更新,我们现在有64个周期.
C++代码:
template<uint8_t mask>
void iterateWithMask(const uint8_t& b0,const uint8_t& b1, uint16_t& a, uint16_t& r) {
if (b0 & mask)
r += a;
if (b1 & mask)
r+=(uint16_t)((uint8_t)(a & 0xff))*256;
a += a;
}
uint16_t umul16_(uint16_t a, const uint16_t b) {
uint16_t r = 0;
const uint8_t b0 = b & 0xff;
const uint8_t b1 = b >>8;
iterateWithMask<0x01>(b0,b1,a,r);
iterateWithMask<0x02>(b0,b1,a,r);
iterateWithMask<0x04>(b0,b1,a,r);
iterateWithMask<0x08>(b0,b1,a,r);
iterateWithMask<0x10>(b0,b1,a,r);
iterateWithMask<0x20>(b0,b1,a,r);
iterateWithMask<0x40>(b0,b1,a,r);
iterateWithMask<0x80>(b0,b1,a,r);
//Hopefully he understands he doesn't need the last update for a
return r;
}
Run Code Online (Sandbox Code Playgroud)
请注意,在此实现中,0x01,0x02不是实际值,而只是提示编译器知道要测试哪个位.因此,无法通过向右移动获得掩码:与目前为止看到的所有其他函数不同,这实际上没有等效的循环版本.
一个大问题是
r+=(uint16_t)((uint8_t)(a & 0xff))*256;
Run Code Online (Sandbox Code Playgroud)
它应该只是r低位寄存器的高位寄存器的总和a.不按我的意愿解释.其他选择:
r+=(uint16_t) 256 *((uint8_t)(a & 0xff));
Run Code Online (Sandbox Code Playgroud)
我们也可以保持a不变,转而改变结果r.在这种情况下,我们b从最重要的位开始处理.复杂性是等价的,但编译器可能更容易消化.此外,这次我们必须小心地明确地写出最后一个循环,它不能再进一步向右移动r.
template<uint8_t mask>
void inverseIterateWithMask(const uint8_t& b0,const uint8_t& b1,const uint16_t& a, const uint8_t& a0, uint16_t& r) {
if (b0 & mask)
r += a;
if (b1 & mask)
r+=(uint16_t)256*a0; //Hopefully easier to understand for the compiler?
r += r;
}
uint16_t umul16_(const uint16_t a, const uint16_t b) {
uint16_t r = 0;
const uint8_t b0 = b & 0xff;
const uint8_t b1 = b >>8;
const uint8_t a0 = a & 0xff;
inverseIterateWithMask<0x80>(b0,b1,a,r);
inverseIterateWithMask<0x40>(b0,b1,a,r);
inverseIterateWithMask<0x20>(b0,b1,a,r);
inverseIterateWithMask<0x10>(b0,b1,a,r);
inverseIterateWithMask<0x08>(b0,b1,a,r);
inverseIterateWithMask<0x04>(b0,b1,a,r);
inverseIterateWithMask<0x02>(b0,b1,a,r);
//Last iteration:
if (b0 & 0x01)
r += a;
if (b1 & 0x01)
r+=(uint16_t)256*a0;
return r;
}
Run Code Online (Sandbox Code Playgroud)