cmpxchg for WORD比BYTE快

Question

cmpxchg for WORD比BYTE快

sig*_*sen 7 c++ assembly multithreading inline-assembly

昨天我张贴这个问题上如何写一个快速自旋锁.感谢Cory Nelson,我似乎找到了一个比我的问题中讨论的其他方法更优越的方法.我使用该CMPXCHG指令来检查锁是否为0并因此是空闲的.CMPXCHG在'BYTE'上运作,WORD并且DWORD.我会假设指令运行得更快BYTE.但我写了一个实现每种数据类型的锁:

inline void spin_lock_8(char* lck)
{
    __asm
    {
        mov ebx, lck                        ;move lck pointer into ebx
        xor cl, cl                          ;set CL to 0
        inc cl                              ;increment CL to 1
        pause                               ;
        spin_loop:
        xor al, al                          ;set AL to 0
        lock cmpxchg byte ptr [ebx], cl     ;compare AL to CL. If equal ZF is set and CL is loaded into address pointed to by ebx
        jnz spin_loop                       ;jump to spin_loop if ZF
    }
}
inline void spin_lock_16(short* lck)
{
    __asm
    {
        mov ebx, lck
        xor cx, cx
        inc cx
        pause
        spin_loop:
        xor ax, ax
        lock cmpxchg word ptr [ebx], cx
        jnz spin_loop
    }
}
inline void spin_lock_32(int* lck)
{
    __asm
    {
        mov ebx, lck
        xor ecx, ecx
        inc ecx
        pause
        spin_loop:
        xor eax, eax
        lock cmpxchg dword ptr [ebx], ecx
        jnz spin_loop
    }
}
inline spin_unlock(<anyType>* lck)
{
    __asm
    {
        mov ebx, lck
        mov <byte/word/dword> ptr [ebx], 0
    }
}

Run Code Online (Sandbox Code Playgroud)

然后使用以下伪代码测试锁(请注意,lcm指针始终指向可被4分割的地址):

<int/short/char>* lck;
threadFunc()
{
    loop 10,000,000 times
    {
        spin_lock_8/16/32 (lck);
        spin_unlock(lck);
    }
}
main()
{
    lck = (char/short/int*)_aligned_malloc(4, 4);//Ensures memory alignment
    start 1 thread running threadFunc and measure time;
    start 2 threads running threadFunc and measure time;
    start 4 threads running threadFunc and measure time;
    _aligned_free(lck);
}

Run Code Online (Sandbox Code Playgroud)

我已经在具有2个物理内核的处理器上以msecs测量了以下结果,能够运行4个线程(Ivy Bridge).

           1 thread    2 threads     4 threads
8-bit      200         700           3200
16-bit     200         500           1400
32-bit     200         900           3400

Run Code Online (Sandbox Code Playgroud)

数据表明所有函数都需要相同的时间来执行.但是当多个线程必须检查lck == 0使用16位是否可以明显更快.这是为什么？我不认为它与对齐有什么关系lck？

提前致谢.

Answer 1

Ale*_*lke 2

据我所知，锁作用于一个字（2 个字节）。486 首次引入时就是这样写的。

如果你对不同大小的锁进行锁定，它实际上会生成相当于 2 个锁（双字的锁定字 A 和字 B）。对于一个字节，它可能必须防止锁定第二个字节，这有点类似到 2 个锁...

所以你的结果与CPU优化是一致的。

归档时间：	13 年，4 月前
查看次数：	1025 次
最近记录：	12 年，11 月前