我正在为双Cortex-A9处理器优化两个单维阵列的元素乘法.Linux正在运行,我正在使用GCC 4.5.2编译器.
所以以下是我的C++内联汇编程序函数.src1,src2和dst是16字节对齐的.
更新:可测试代码:
void Multiply(
const float* __restrict__ src1,
const float* __restrict__ src2,
float* __restrict__ dst,
const unsigned int width,
const unsigned int height)
{
int loopBound = (width * height) / 4;
asm volatile(
".loop: \n\t"
"vld1.32 {q1}, [%[src1]:128]! \n\t"
"vld1.32 {q2}, [%[src2]:128]! \n\t"
"vmul.f32 q0, q1, q2 \n\t"
"vst1.32 {q0}, [%[dst]:128]! \n\t"
"subs %[lBound], %[lBound], $1 \n\t"
"bge .loop \n\t"
:
:[dst] "r" (dst), [src1] "r" (src1), [src2] "r" (src2),
[lBound] "r" (loopBound)
:"memory", "d0", "d1", "d2", …Run Code Online (Sandbox Code Playgroud)