小编Hyr*_*axK的帖子

通过ARM NEON组件最大限度地优化元素乘法

我正在为双Cortex-A9处理器优化两个单维阵列的元素乘法.Linux正在运行,我正在使用GCC 4.5.2编译器.

所以以下是我的C++内联汇编程序函数.src1,src2和dst是16字节对齐的.

更新:可测试代码:

void Multiply(
    const float* __restrict__ src1,
    const float* __restrict__ src2,
    float* __restrict__ dst,
    const unsigned int width,
    const unsigned int height)
{
    int loopBound = (width * height) / 4;
    asm volatile(
        ".loop:                             \n\t"
        "vld1.32  {q1}, [%[src1]:128]!      \n\t"
        "vld1.32  {q2}, [%[src2]:128]!      \n\t"
        "vmul.f32 q0, q1, q2                \n\t"
        "vst1.32  {q0}, [%[dst]:128]!       \n\t"
        "subs     %[lBound], %[lBound], $1  \n\t"
        "bge      .loop                     \n\t"
        :
        :[dst] "r" (dst), [src1] "r" (src1), [src2] "r" (src2),
        [lBound] "r" (loopBound)
        :"memory", "d0", "d1", "d2", …

Run Code Online (Sandbox Code Playgroud)

c++ optimization assembly arm neon

Hyr*_*axK

2013 01-27

8
推荐指数

1
解决办法

1497
查看次数