Ber*_*Ber 42 embedded arm square-root
我正在寻找一个快速,仅整数算法来找到无符号整数的平方根(整数部分).代码必须在ARM Thumb 2处理器上具有出色的性能.它可以是汇编语言或C代码.
任何提示欢迎.
Cra*_*een 32
Jack W. Crenshaw的整数平方根可以作为另一个参考.
该Ç片段存档也有一个整数的平方根的实现.这个超出了整数结果,并计算答案的额外分数(定点)位.(更新:不幸的是,C片段存档现已不存在.该链接指向该页面的Web存档.)以下是C片段存档中的代码:
#define BITSPERLONG 32
#define TOP2BITS(x) ((x & (3L << (BITSPERLONG-2))) >> (BITSPERLONG-2))
struct int_sqrt {
unsigned sqrt, frac;
};
/* usqrt:
ENTRY x: unsigned long
EXIT returns floor(sqrt(x) * pow(2, BITSPERLONG/2))
Since the square root never uses more than half the bits
of the input, we use the other half of the bits to contain
extra bits of precision after the binary point.
EXAMPLE
suppose BITSPERLONG = 32
then usqrt(144) = 786432 = 12 * 65536
usqrt(32) = 370727 = 5.66 * 65536
NOTES
(1) change BITSPERLONG to BITSPERLONG/2 if you do not want
the answer scaled. Indeed, if you want n bits of
precision after the binary point, use BITSPERLONG/2+n.
The code assumes that BITSPERLONG is even.
(2) This is really better off being written in assembly.
The line marked below is really a "arithmetic shift left"
on the double-long value with r in the upper half
and x in the lower half. This operation is typically
expressible in only one or two assembly instructions.
(3) Unrolling this loop is probably not a bad idea.
ALGORITHM
The calculations are the base-two analogue of the square
root algorithm we all learned in grammar school. Since we're
in base 2, there is only one nontrivial trial multiplier.
Notice that absolutely no multiplications or divisions are performed.
This means it'll be fast on a wide range of processors.
*/
void usqrt(unsigned long x, struct int_sqrt *q)
{
unsigned long a = 0L; /* accumulator */
unsigned long r = 0L; /* remainder */
unsigned long e = 0L; /* trial product */
int i;
for (i = 0; i < BITSPERLONG; i++) /* NOTE 1 */
{
r = (r << 2) + TOP2BITS(x); x <<= 2; /* NOTE 2 */
a <<= 1;
e = (a << 1) + 1;
if (r >= e)
{
r -= e;
a++;
}
}
memcpy(q, &a, sizeof(long));
}
Run Code Online (Sandbox Code Playgroud)
我决定使用以下代码.它主要来自维基百科关于平方根计算方法的文章.但它已被改为使用stdint.h类型uint32_t等.严格来说,返回类型可以更改为uint16_t.
/**
* \brief Fast Square root algorithm
*
* Fractional parts of the answer are discarded. That is:
* - SquareRoot(3) --> 1
* - SquareRoot(4) --> 2
* - SquareRoot(5) --> 2
* - SquareRoot(8) --> 2
* - SquareRoot(9) --> 3
*
* \param[in] a_nInput - unsigned integer for which to find the square root
*
* \return Integer square root of the input value.
*/
uint32_t SquareRoot(uint32_t a_nInput)
{
uint32_t op = a_nInput;
uint32_t res = 0;
uint32_t one = 1uL << 30; // The second-to-top bit is set: use 1u << 14 for uint16_t type; use 1uL<<30 for uint32_t type
// "one" starts at the highest power of four <= than the argument.
while (one > op)
{
one >>= 2;
}
while (one != 0)
{
if (op >= res + one)
{
op = op - (res + one);
res = res + 2 * one;
}
res >>= 1;
one >>= 2;
}
return res;
}
Run Code Online (Sandbox Code Playgroud)
我发现,好处是,一个相当简单的修改可以返回"圆润"的答案.我发现这在某些应用中非常有用,可以提高准确性.请注意,在这种情况下,返回类型必须是uint32_t因为2 32 - 1 的圆角平方根是2 16.
/**
* \brief Fast Square root algorithm, with rounding
*
* This does arithmetic rounding of the result. That is, if the real answer
* would have a fractional part of 0.5 or greater, the result is rounded up to
* the next integer.
* - SquareRootRounded(2) --> 1
* - SquareRootRounded(3) --> 2
* - SquareRootRounded(4) --> 2
* - SquareRootRounded(6) --> 2
* - SquareRootRounded(7) --> 3
* - SquareRootRounded(8) --> 3
* - SquareRootRounded(9) --> 3
*
* \param[in] a_nInput - unsigned integer for which to find the square root
*
* \return Integer square root of the input value.
*/
uint32_t SquareRootRounded(uint32_t a_nInput)
{
uint32_t op = a_nInput;
uint32_t res = 0;
uint32_t one = 1uL << 30; // The second-to-top bit is set: use 1u << 14 for uint16_t type; use 1uL<<30 for uint32_t type
// "one" starts at the highest power of four <= than the argument.
while (one > op)
{
one >>= 2;
}
while (one != 0)
{
if (op >= res + one)
{
op = op - (res + one);
res = res + 2 * one;
}
res >>= 1;
one >>= 2;
}
/* Do arithmetic rounding to nearest integer */
if (op > res)
{
res++;
}
return res;
}
Run Code Online (Sandbox Code Playgroud)
Dav*_*ble 15
如果不需要精确的准确度,我有一个快速的近似值,使用260字节的ram(你可以减半,但不要).
int ftbl[33]={0,1,1,2,2,4,5,8,11,16,22,32,45,64,90,128,181,256,362,512,724,1024,1448,2048,2896,4096,5792,8192,11585,16384,23170,32768,46340};
int ftbl2[32]={ 32768,33276,33776,34269,34755,35235,35708,36174,36635,37090,37540,37984,38423,38858,39287,39712,40132,40548,40960,41367,41771,42170,42566,42959,43347,43733,44115,44493,44869,45241,45611,45977};
int fisqrt(int val)
{
int cnt=0;
int t=val;
while (t) {cnt++;t>>=1;}
if (6>=cnt) t=(val<<(6-cnt));
else t=(val>>(cnt-6));
return (ftbl[cnt]*ftbl2[t&31])>>15;
}
Run Code Online (Sandbox Code Playgroud)
这是生成表的代码:
ftbl[0]=0;
for (int i=0;i<32;i++) ftbl[i+1]=sqrt(pow(2.0,i));
printf("int ftbl[33]={0");
for (int i=0;i<32;i++) printf(",%d",ftbl[i+1]);
printf("};\n");
for (int i=0;i<32;i++) ftbl2[i]=sqrt(1.0+i/32.0)*32768;
printf("int ftbl2[32]={");
for (int i=0;i<32;i++) printf("%c%d",(i)?',':' ',ftbl2[i]);
printf("};\n");
Run Code Online (Sandbox Code Playgroud)
在1-> 2 ^ 20的范围内,最大误差为11,在1-> 2 ^ 30的范围内,它大约为256.您可以使用更大的表并将其最小化.值得一提的是,错误总是负面的 - 即当错误时,该值将低于正确的值.
您可以通过精炼阶段来完成这项工作.
这个想法很简单:(ab)^ 0.5 = a ^ 0.b*b ^ 0.5.
所以,我们取输入X = A*B,其中A = 2 ^ N且1 <= B <2然后我们有sqrt(2 ^ N)的查找表,以及sqrt的查找表(1 <= B <2) .我们将sqrt(2 ^ N)的查找表存储为整数,这可能是一个错误(测试显示没有不良影响),并且我们将sqrt(1 <= B <2)的查找表存储在15bit的定点.
我们知道1 <= sqrt(2 ^ N)<65536,因此它是16位,我们知道我们实际上只能在ARM上乘以16bitx15bit而不用担心报复,所以这就是我们所做的.
在实现方面,while(t){cnt ++; t >> = 1;}实际上是一个计数前导位指令(CLB),所以如果你的芯片组版本有这个,你就赢了!另外,如果有一个双向移位器,移位指令很容易实现?这里有一个用于计算最高设置位的Lg [N]算法.
就幻数而言,对于更改表格大小,ftbl2的幻数为32,但请注意6(Lg [32] +1)用于移位.
一种常见的方法是二分法.
hi = number
lo = 0
mid = ( hi + lo ) / 2
mid2 = mid*mid
while( lo < hi-1 and mid2 != number ) {
if( mid2 < number ) {
lo = mid
else
hi = mid
mid = ( hi + lo ) / 2
mid2 = mid*mid
Run Code Online (Sandbox Code Playgroud)
这样的事情应该合理地运作.它使log2(数字)测试,做log2(数字)乘法和除法.由于除数除以2,您可以将其替换为>>.
终止条件可能不是点,所以一定要测试各种整数,以确保2除法不会在两个偶数值之间错误地振荡; 他们的差异将超过1.
小智 7
我发现大多数算法都基于简单的想法,但是以比必要更复杂的方式实现.我从这里采纳了这个想法:http://ww1.microchip.com/downloads/en/AppNotes/91040a.pdf(由Ross M. Fosler撰写)并使其成为一个非常短的C函数:
uint16_t int_sqrt32(uint32_t x)
{
uint16_t res=0;
uint16_t add= 0x8000;
int i;
for(i=0;i<16;i++)
{
uint16_t temp=res | add;
uint32_t g2=temp*temp;
if (x>=g2)
{
res=temp;
}
add>>=1;
}
return res;
}
Run Code Online (Sandbox Code Playgroud)
这在我的blackfin上编译为5个周期/位.我相信如果你使用for循环而不是while循环,你的编译代码通常会更快,并且你获得了确定性时间的额外好处(尽管这在某种程度上取决于你的编译器如何优化if语句.)
这取决于sqrt函数的用法.我经常使用一些近似来制作快速版本.例如,当我需要计算向量模块时:
Module = SQRT( x^2 + y^2)
Run Code Online (Sandbox Code Playgroud)
我用 :
Module = MAX( x,y) + Min(x,y)/2
Run Code Online (Sandbox Code Playgroud)
可以在3或4条指令中编码为:
If (x > y )
Module = x + y >> 1;
Else
Module = y + x >> 1;
Run Code Online (Sandbox Code Playgroud)
小智 6
它并不快,但它小而简单:
int isqrt(int n)
{
int b = 0;
while(n >= 0)
{
n = n - b;
b = b + 1;
n = n - b;
}
return b - 1;
}
Run Code Online (Sandbox Code Playgroud)