ana*_*lyg 11 c optimization assembly bit-manipulation
我有一些性能很重的代码执行位操作.它可以简化为以下明确定义的问题:
给定一个13位位图,构造一个26位位图,其中包含在偶数位置间隔的原始位.
为了显示:
0000000000000000000abcdefghijklm (input, 32 bits)
0000000a0b0c0d0e0f0g0h0i0j0k0l0m (output, 32 bits)
Run Code Online (Sandbox Code Playgroud)
我目前在C中以下列方式实现它:
if (input & (1 << 12))
output |= 1 << 24;
if (input & (1 << 11))
output |= 1 << 22;
if (input & (1 << 10))
output |= 1 << 20;
...
Run Code Online (Sandbox Code Playgroud)
我的编译器(MS Visual Studio)将其转换为以下内容:
test eax,1000h
jne 0064F5EC
or edx,1000000h
... (repeated 13 times with minor differences in constants)
Run Code Online (Sandbox Code Playgroud)
我想知道我是否可以更快地完成任务.我想用C语言编写代码,但是可以切换到汇编语言.
有一种聪明的方法可以做到这一点,这在这里可能会有所帮助.它实际上解决了一个稍微更普遍的位改组问题.您的问题有以下输入:
+---------------+---------------+---------------+---------------+
|0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0|0 0 0 a b c d e|f g h i j k l m|
+---------------+---------------+---------------+---------------+
Run Code Online (Sandbox Code Playgroud)
....但是让我们考虑所有的事情:
+---------------+---------------+---------------+---------------+
|A B C D E F G H|I J K L M N O P|Q R S a b c d e|f g h i j k l m|
+---------------+---------------+---------------+---------------+
Run Code Online (Sandbox Code Playgroud)
并试图将它们全部交错:
+---------------+---------------+---------------+---------------+
|A Q B R C S D a|E b F c G d H e|I f J g K h L i|M j N k O l P m|
+---------------+---------------+---------------+---------------+
Run Code Online (Sandbox Code Playgroud)
第一步,考虑输入的中间部分:
bit 31 24 16 8 0
v v v v v
+---------------+---------------+---------------+---------------+
| |I J K L M N O P|Q R S a b c d e| |
+---------------+---------------+---------------+---------------+
Run Code Online (Sandbox Code Playgroud)
构造8位值:{ I^Q,J^R,K^S,L^a,M^b,N^c,O^d,P^e}.
如果我们将这个8位值与位[15:8]异或,并且还与位[23:16]异或相同的8位值,我们将交换中间两个字节:例如,位23 (原来I)将成为第I ^ (I^Q) = Q15位(原本Q)将成为Q ^ (I^Q) = I.
要做到这一点tmp = (input ^ (input >> 8)) & 0x0000ff00;::
+---------------+---------------+---------------+---------------+
|A B C D E F G H|I J K L M N O P|Q R S a b c d e|f g h i j k l m| input
+---------------+---------------+---------------+---------------+
exclusive-OR with:
+---------------+---------------+---------------+---------------+
|0 0 0 0 0 0 0 0|A B C D E F G H|I J K L M N O P|Q R S a b c d e| input >> 8
+---------------+---------------+---------------+---------------+
-->|want these bits|<--
mask (bitwise AND) with 0x0000ff00:
+---------------+---------------+---------------+---------------+
|0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0|1 1 1 1 1 1 1 1|0 0 0 0 0 0 0 0| 0x0000ff00
+---------------+---------------+---------------+---------------+
Run Code Online (Sandbox Code Playgroud)
现在我们需要的8位值是位[15:8],其他所有位都是0.现在我们可以用
input ^= (tmp ^ (tmp << 8));
Run Code Online (Sandbox Code Playgroud)
导致:
+---------------+---------------+---------------+---------------+
|A B C D E F G H|Q R S a b c d e|I J K L M N O P|f g h i j k l m| input
+---------------+---------------+---------------+---------------+
Run Code Online (Sandbox Code Playgroud)
对于下一步,分而治之......执行左半部分中间位的类似交换:
+---------------+---------------+---------------+---------------+
|A B C D E F G H|Q R S a b c d e| | |
+---------------+---------------+---------------+---------------+
becomes
+---------------+---------------+---------------+---------------+
|A B C D Q R S a|E F G H b c d e| | |
+---------------+---------------+---------------+---------------+
Run Code Online (Sandbox Code Playgroud)
......和右半边:
+---------------+---------------+---------------+---------------+
| | |I J K L M N O P|f g h i j k l m|
+---------------+---------------+---------------+---------------+
becomes
+---------------+---------------+---------------+---------------+
| | |I J K L f g h i|M N O P j k l m|
+---------------+---------------+---------------+---------------+
Run Code Online (Sandbox Code Playgroud)
我们可以使用与第一步完全相同的技巧,因为我们希望在32位字的两个16位半部分执行完全相同的操作,我们可以并行执行:
tmp = (input ^ (input >> 4)) & 0x00f000f0;
Run Code Online (Sandbox Code Playgroud)
构造我们将用于交换的两对4位,然后
input ^= (tmp ^ (tmp << 4));
Run Code Online (Sandbox Code Playgroud)
实际上是交换.
在交换完成之前,我们可以继续应用相同的原则.在每个点参与交换的位标记为#:
+---------------+---------------+---------------+---------------+
|A B C D E F G H|I J K L M N O P|Q R S a b c d e|f g h i j k l m|
+---------------+---------------+---------------+---------------+
###############/###############
+---------------+---------------+---------------+---------------+
|A B C D E F G H|Q R S a b c d e|I J K L M N O P|f g h i j k l m|
+---------------+---------------+---------------+---------------+
#######/####### #######/#######
+---------------+---------------+---------------+---------------+
|A B C D Q R S a|E F G H b c d e|I J K L f g h i|M N O P j k l m|
+---------------+---------------+---------------+---------------+
###/### ###/### ###/### ###/###
+---------------+---------------+---------------+---------------+
|A B Q R C D S a|E F b c G H d e|I J f g K L h i|M N j k O P l m|
+---------------+---------------+---------------+---------------+
#/# #/# #/# #/# #/# #/# #/# #/#
+---------------+---------------+---------------+---------------+
|A Q B R C S D a|E b F c G d G e|I f J g K h L i|M j N k O l P m|
+---------------+---------------+---------------+---------------+
Run Code Online (Sandbox Code Playgroud)
码:
tmp = (input ^ (input >> 8)) & 0x0000ff00;
input ^= (tmp ^ (tmp << 8));
tmp = (input ^ (input >> 4)) & 0x00f000f0;
input ^= (tmp ^ (tmp << 4));
tmp = (input ^ (input >> 2)) & 0x0c0c0c0c;
input ^= (tmp ^ (tmp << 2));
tmp = (input ^ (input >> 1)) & 0x22222222;
input ^= (tmp ^ (tmp << 1)); /* = output */
Run Code Online (Sandbox Code Playgroud)
可以通过向后运行4个步骤来执行反向操作:
tmp = (input ^ (input >> 1)) & 0x22222222;
input ^= (tmp ^ (tmp << 1)); /* = output */
tmp = (input ^ (input >> 2)) & 0x0c0c0c0c;
input ^= (tmp ^ (tmp << 2));
tmp = (input ^ (input >> 4)) & 0x00f000f0;
input ^= (tmp ^ (tmp << 4));
tmp = (input ^ (input >> 8)) & 0x0000ff00;
input ^= (tmp ^ (tmp << 8));
Run Code Online (Sandbox Code Playgroud)
虽然您可以针对您的特定应用对此进行改进,但如果已知其他每个位为零:请参阅此处对其他问题的回答.
最后要注意的是,如果没有在您的应用程序中对它们进行基准测试,请不要相信任何人对此处建议的任何方法的相对性能的评论.(特别是,由于从缓存中驱逐大量其他数据,大型查找表在简单的微基准测试中看起来比实际在给定的实际应用中实际更好,这可能对外部循环产生负面影响(S).)
小智 5
使用查找表执行此操作.2 ^ 13听起来像很多条目,但它们很容易适应CPU缓存.
哦,如果其他19位中有垃圾,你需要先将它们屏蔽掉.