寻找ARM Thumb2的有效整数平方根算法

Ber*_*Ber 42 embedded arm square-root

我正在寻找一个快速,仅整数算法来找到无符号整数的平方根(整数部分).代码必须在ARM Thumb 2处理器上具有出色的性能.它可以是汇编语言或C代码.

任何提示欢迎.

Cra*_*een 32

Jack W. Crenshaw的整数平方根可以作为另一个参考.

Ç片段存档也有一个整数的平方根的实现.这个超出了整数结果,并计算答案的额外分数(定点)位.(更新:不幸的是,C片段存档现已不存在.该链接指向该页面的Web存档.)以下是C片段存档中的代码:

#define BITSPERLONG 32
#define TOP2BITS(x) ((x & (3L << (BITSPERLONG-2))) >> (BITSPERLONG-2))

struct int_sqrt {
    unsigned sqrt, frac;
};

/* usqrt:
    ENTRY x: unsigned long
    EXIT  returns floor(sqrt(x) * pow(2, BITSPERLONG/2))

    Since the square root never uses more than half the bits
    of the input, we use the other half of the bits to contain
    extra bits of precision after the binary point.

    EXAMPLE
        suppose BITSPERLONG = 32
        then    usqrt(144) = 786432 = 12 * 65536
                usqrt(32) = 370727 = 5.66 * 65536

    NOTES
        (1) change BITSPERLONG to BITSPERLONG/2 if you do not want
            the answer scaled.  Indeed, if you want n bits of
            precision after the binary point, use BITSPERLONG/2+n.
            The code assumes that BITSPERLONG is even.
        (2) This is really better off being written in assembly.
            The line marked below is really a "arithmetic shift left"
            on the double-long value with r in the upper half
            and x in the lower half.  This operation is typically
            expressible in only one or two assembly instructions.
        (3) Unrolling this loop is probably not a bad idea.

    ALGORITHM
        The calculations are the base-two analogue of the square
        root algorithm we all learned in grammar school.  Since we're
        in base 2, there is only one nontrivial trial multiplier.

        Notice that absolutely no multiplications or divisions are performed.
        This means it'll be fast on a wide range of processors.
*/

void usqrt(unsigned long x, struct int_sqrt *q)
{
      unsigned long a = 0L;                   /* accumulator      */
      unsigned long r = 0L;                   /* remainder        */
      unsigned long e = 0L;                   /* trial product    */

      int i;

      for (i = 0; i < BITSPERLONG; i++)   /* NOTE 1 */
      {
            r = (r << 2) + TOP2BITS(x); x <<= 2; /* NOTE 2 */
            a <<= 1;
            e = (a << 1) + 1;
            if (r >= e)
            {
                  r -= e;
                  a++;
            }
      }
      memcpy(q, &a, sizeof(long));
}
Run Code Online (Sandbox Code Playgroud)

我决定使用以下代码.它主要来自维基百科关于平方根计算方法的文章.但它已被改为使用stdint.h类型uint32_t等.严格来说,返回类型可以更改为uint16_t.

/**
 * \brief    Fast Square root algorithm
 *
 * Fractional parts of the answer are discarded. That is:
 *      - SquareRoot(3) --> 1
 *      - SquareRoot(4) --> 2
 *      - SquareRoot(5) --> 2
 *      - SquareRoot(8) --> 2
 *      - SquareRoot(9) --> 3
 *
 * \param[in] a_nInput - unsigned integer for which to find the square root
 *
 * \return Integer square root of the input value.
 */
uint32_t SquareRoot(uint32_t a_nInput)
{
    uint32_t op  = a_nInput;
    uint32_t res = 0;
    uint32_t one = 1uL << 30; // The second-to-top bit is set: use 1u << 14 for uint16_t type; use 1uL<<30 for uint32_t type


    // "one" starts at the highest power of four <= than the argument.
    while (one > op)
    {
        one >>= 2;
    }

    while (one != 0)
    {
        if (op >= res + one)
        {
            op = op - (res + one);
            res = res +  2 * one;
        }
        res >>= 1;
        one >>= 2;
    }
    return res;
}
Run Code Online (Sandbox Code Playgroud)

我发现,好处是,一个相当简单的修改可以返回"圆润"的答案.我发现这在某些应用中非常有用,可以提高准确性.请注意,在这种情况下,返回类型必须是uint32_t因为2 32  - 1 的圆角平方根是2 16.

/**
 * \brief    Fast Square root algorithm, with rounding
 *
 * This does arithmetic rounding of the result. That is, if the real answer
 * would have a fractional part of 0.5 or greater, the result is rounded up to
 * the next integer.
 *      - SquareRootRounded(2) --> 1
 *      - SquareRootRounded(3) --> 2
 *      - SquareRootRounded(4) --> 2
 *      - SquareRootRounded(6) --> 2
 *      - SquareRootRounded(7) --> 3
 *      - SquareRootRounded(8) --> 3
 *      - SquareRootRounded(9) --> 3
 *
 * \param[in] a_nInput - unsigned integer for which to find the square root
 *
 * \return Integer square root of the input value.
 */
uint32_t SquareRootRounded(uint32_t a_nInput)
{
    uint32_t op  = a_nInput;
    uint32_t res = 0;
    uint32_t one = 1uL << 30; // The second-to-top bit is set: use 1u << 14 for uint16_t type; use 1uL<<30 for uint32_t type


    // "one" starts at the highest power of four <= than the argument.
    while (one > op)
    {
        one >>= 2;
    }

    while (one != 0)
    {
        if (op >= res + one)
        {
            op = op - (res + one);
            res = res +  2 * one;
        }
        res >>= 1;
        one >>= 2;
    }

    /* Do arithmetic rounding to nearest integer */
    if (op > res)
    {
        res++;
    }

    return res;
}
Run Code Online (Sandbox Code Playgroud)

  • @RobertBasler:你测量它很好.这种事情是非常特定于硬件的; 在您的情况下(在具有浮点硬件的处理器上),当然值得进行比较.我希望这些整数平方根算法对于没有浮点硬件的嵌入式系统更有用. (6认同)
  • 出于好奇,我基于64位转换对静态广播C库sqrt函数进行基准测试以获得整数结果,我发现这个速度要慢8.2倍.因人而异.有关更多数据,请访问http://onemanmmo.com/?sqrt (4认同)
  • 对于Cortex M3和兄弟,第一个循环可以用前导零计数和掩码操作代替:one >> = clz(op)&~0x3; 停掉好~30个周期. (3认同)
  • IEEE双精度浮点可以精确地表示高达~53位(尾数的大小)的整数,但除此之外,结果是不精确的.整数sqrt的一个优点是它总能给出确切的答案. (2认同)

Dav*_*ble 15

如果不需要精确的准确度,我有一个快速的近似值,使用260字节的ram(你可以减半,但不要).

int ftbl[33]={0,1,1,2,2,4,5,8,11,16,22,32,45,64,90,128,181,256,362,512,724,1024,1448,2048,2896,4096,5792,8192,11585,16384,23170,32768,46340};
int ftbl2[32]={ 32768,33276,33776,34269,34755,35235,35708,36174,36635,37090,37540,37984,38423,38858,39287,39712,40132,40548,40960,41367,41771,42170,42566,42959,43347,43733,44115,44493,44869,45241,45611,45977};

int fisqrt(int val)
{
    int cnt=0;
    int t=val;
    while (t) {cnt++;t>>=1;}
    if (6>=cnt)    t=(val<<(6-cnt));
    else           t=(val>>(cnt-6));

    return (ftbl[cnt]*ftbl2[t&31])>>15;
}
Run Code Online (Sandbox Code Playgroud)

这是生成表的代码:

ftbl[0]=0;
for (int i=0;i<32;i++) ftbl[i+1]=sqrt(pow(2.0,i));
printf("int ftbl[33]={0");
for (int i=0;i<32;i++) printf(",%d",ftbl[i+1]);
printf("};\n");

for (int i=0;i<32;i++) ftbl2[i]=sqrt(1.0+i/32.0)*32768;
printf("int ftbl2[32]={");
for (int i=0;i<32;i++) printf("%c%d",(i)?',':' ',ftbl2[i]);
printf("};\n");
Run Code Online (Sandbox Code Playgroud)

在1-> 2 ^ 20的范围内,最大误差为11,在1-> 2 ^ 30的范围内,它大约为256.您可以使用更大的表并将其最小化.值得一提的是,错误总是负面的 - 即当错误时,该值将低于正确的值.

您可以通过精炼阶段来完成这项工作.

这个想法很简单:(ab)^ 0.5 = a ^ 0.b*b ^ 0.5.

所以,我们取输入X = A*B,其中A = 2 ^ N且1 <= B <2然后我们有sqrt(2 ^ N)的查找表,以及sqrt的查找表(1 <= B <2) .我们将sqrt(2 ^ N)的查找表存储为整数,这可能是一个错误(测试显示没有不良影响),并且我们将sqrt(1 <= B <2)的查找表存储在15bit的定点.

我们知道1 <= sqrt(2 ^ N)<65536,因此它是16位,我们知道我们实际上只能在ARM上乘以16bitx15bit而不用担心报复,所以这就是我们所做的.

在实现方面,while(t){cnt ++; t >> = 1;}实际上是一个计数前导位指令(CLB),所以如果你的芯片组版本有这个,你就赢了!另外,如果有一个双向移位器,移位指令很容易实现?这里有一个用于计算最高设置位的Lg [N]算法.

就幻数而言,对于更改表格大小,ftbl2的幻数为32,但请注意6(Lg [32] +1)用于移位.


S.L*_*ott 8

一种常见的方法是二分法.

hi = number
lo = 0
mid = ( hi + lo ) / 2
mid2 = mid*mid
while( lo < hi-1 and mid2 != number ) {
    if( mid2 < number ) {
        lo = mid
    else
        hi = mid
    mid = ( hi + lo ) / 2
    mid2 = mid*mid
Run Code Online (Sandbox Code Playgroud)

这样的事情应该合理地运作.它使log2(数字)测试,做log2(数字)乘法和除法.由于除数除以2,您可以将其替换为>>.

终止条件可能不是点,所以一定要测试各种整数,以确保2除法不会在两个偶数值之间错误地振荡; 他们的差异将超过1.


小智 7

我发现大多数算法都基于简单的想法,但是以比必要更复杂的方式实现.我从这里采纳了这个想法:http://ww1.microchip.com/downloads/en/AppNotes/91040a.pdf(由Ross M. Fosler撰写)并使其成为一个非常短的C函数:

uint16_t int_sqrt32(uint32_t x)
{
    uint16_t res=0;
    uint16_t add= 0x8000;   
    int i;
    for(i=0;i<16;i++)
    {
        uint16_t temp=res | add;
        uint32_t g2=temp*temp;      
        if (x>=g2)
        {
            res=temp;           
        }
        add>>=1;
    }
    return res;
}
Run Code Online (Sandbox Code Playgroud)

这在我的blackfin上编译为5个周期/位.我相信如果你使用for循环而不是while循环,你的编译代码通常会更快,并且你获得了确定性时间的额外好处(尽管这在某种程度上取决于你的编译器如何优化if语句.)

  • 这里有一个小错误.在表达式"temp*temp"中,您需要将任一操作数强制转换为uint32_t,以确保乘法是在32位算术中完成的,而不是16位.由于这个原因,代码as-is在AVR上不起作用(但似乎在int为32位的平台上,由于defaut升级,但它仍然可能导致整数溢出). (5认同)
  • 另一件事:“ uint16_t add = 0x8000;” 应该更改为“ uint16_t add = UINT16_C(0x8000);”。 (2认同)

Yaz*_*zou 7

这取决于sqrt函数的用法.我经常使用一些近似来制作快速版本.例如,当我需要计算向量模块时:

Module = SQRT( x^2 + y^2)
Run Code Online (Sandbox Code Playgroud)

我用 :

Module = MAX( x,y) + Min(x,y)/2
Run Code Online (Sandbox Code Playgroud)

可以在3或4条指令中编码为:

If (x > y )
  Module  = x + y >> 1;
Else
   Module  = y + x >> 1;
Run Code Online (Sandbox Code Playgroud)


小智 6

它并不快,但它小而简单:

int isqrt(int n)
{
  int b = 0;

  while(n >= 0)
  {
    n = n - b;
    b = b + 1;
    n = n - b;
  }

  return b - 1;
}
Run Code Online (Sandbox Code Playgroud)