我正在考虑如何实现将整数(4byte,unsigned)转换为带有SSE指令的字符串.通常的例程是将数字除以并将其存储在局部变量中,然后反转字符串(在此示例中缺少反转例程):
char *convert(unsigned int num, int base) {
static char buff[33];
char *ptr;
ptr = &buff[sizeof(buff) - 1];
*ptr = '\0';
do {
*--ptr="0123456789abcdef"[num%base];
num /= base;
} while(num != 0);
return ptr;
}
Run Code Online (Sandbox Code Playgroud)
但倒置需要额外的时间.是否有任何其他算法可以优先使用SSE指令来并行化该功能?
ice*_*ord 25
Terje Mathisen发明了一种非常快速的itoa(),它不需要查找表.如果您对其工作原理的解释不感兴趣,请跳至性能或实施.
超过15年前,Terje Mathisen为基数10想出了一个并行化的itoa().这个想法是取32位值并将其分成两个5位数的块.(谷歌快速搜索"Terje Mathisen itoa"发表了这篇文章:http://computer-programming-forum.com/46-asm/7aa4b50bce8dd985.htm)
我们这样开始:
void itoa(char *buf, uint32_t val)
{
lo = val % 100000;
hi = val / 100000;
itoa_half(&buf[0], hi);
itoa_half(&buf[5], lo);
}
Run Code Online (Sandbox Code Playgroud)
现在我们可以只需要一个可以将域[0,99999]中的任何整数转换为字符串的算法.一种天真的方式可能是:
// 0 <= val <= 99999
void itoa_half(char *buf, uint32_t val)
{
// Move all but the first digit to the right of the decimal point.
float tmp = val / 10000.0;
for(size_t i = 0; i < 5; i++)
{
// Extract the next digit.
int digit = (int) tmp;
// Convert to a character.
buf[i] = '0' + (char) digit;
// Remove the lead digit and shift left 1 decimal place.
tmp = (tmp - digit) * 10.0;
}
}
Run Code Online (Sandbox Code Playgroud)
我们将使用4.28定点数学,而不是使用浮点数,因为它在我们的情况下明显更快.也就是说,我们将二进制点固定在第28位位置,使得1.0表示为2 ^ 28.要转换为定点,我们只需乘以2 ^ 28.我们可以通过使用0xf0000000屏蔽来轻松舍入到最接近的整数,并且我们可以通过使用0x0fffffff屏蔽来提取小数部分.
(注意:Terje的算法在定点格式的选择上略有不同.)
所以现在我们有:
typedef uint32_t fix4_28;
// 0 <= val <= 99999
void itoa_half(char *buf, uint32_t val)
{
// Convert `val` to fixed-point and divide by 10000 in a single step.
// N.B. we would overflow a uint32_t if not for the parentheses.
fix4_28 tmp = val * ((1 << 28) / 10000);
for(size_t i = 0; i < 5; i++)
{
int digit = (int)(tmp >> 28);
buf[i] = '0' + (char) digit;
tmp = (tmp & 0x0fffffff) * 10;
}
}
Run Code Online (Sandbox Code Playgroud)
此代码的唯一问题是2 ^ 28/10000 = 26843.5456,它被截断为26843.这会导致某些值不准确.例如,itoa_half(buf,83492)生成字符串"83490".如果我们在转换为4.28定点时应用小校正,则该算法适用于域[0,99999]中的所有数字:
// 0 <= val <= 99999
void itoa_half(char *buf, uint32_t val)
{
fix4_28 const f1_10000 = (1 << 28) / 10000;
// 2^28 / 10000 is 26843.5456, but 26843.75 is sufficiently close.
fix4_28 tmp = val * ((f1_10000 + 1) - (val / 4);
for(size_t i = 0; i < 5; i++)
{
int digit = (int)(tmp >> 28);
buf[i] = '0' + (char) digit;
tmp = (tmp & 0x0fffffff) * 10;
}
}
Run Code Online (Sandbox Code Playgroud)
Terje将itoa_half部分交错为低和高的一半:
void itoa(char *buf, uint32_t val)
{
fix4_28 const f1_10000 = (1 << 28) / 10000;
fix4_28 tmplo, tmphi;
lo = val % 100000;
hi = val / 100000;
tmplo = lo * (f1_10000 + 1) - (lo / 4);
tmphi = hi * (f1_10000 + 1) - (hi / 4);
for(size_t i = 0; i < 5; i++)
{
buf[i + 0] = '0' + (char)(tmphi >> 28);
buf[i + 5] = '0' + (char)(tmplo >> 28);
tmphi = (tmphi & 0x0fffffff) * 10;
tmplo = (tmplo & 0x0fffffff) * 10;
}
}
Run Code Online (Sandbox Code Playgroud)
如果循环完全展开,还有一个额外的技巧可以使代码稍快一些.乘以10实现为LEA + SHL或LEA + ADD序列.我们可以通过乘以5来保存1条指令,这只需要一个LEA.这与通过循环将tmphi和tmplo右移1个位置具有相同的效果,但是我们可以通过调整我们的移位计数和掩码进行补偿:
uint32_t mask = 0x0fffffff;
uint32_t shift = 28;
for(size_t i = 0; i < 5; i++)
{
buf[i + 0] = '0' + (char)(tmphi >> shift);
buf[i + 5] = '0' + (char)(tmplo >> shift);
tmphi = (tmphi & mask) * 5;
tmplo = (tmplo & mask) * 5;
mask >>= 1;
shift--;
}
Run Code Online (Sandbox Code Playgroud)
这仅在循环完全展开时才有用,因为您可以预先计算每次迭代的shift和mask的值.
最后,该例程产生零填充结果.如果val == 0,您可以通过返回指向非0的第一个字符或最后一个字符的指针来删除填充:
char *itoa_unpadded(char *buf, uint32_t val)
{
char *p;
itoa(buf, val);
p = buf;
// Note: will break on GCC, but you can work around it by using memcpy() to dereference p.
if (*((uint64_t *) p) == 0x3030303030303030)
p += 8;
if (*((uint32_t *) p) == 0x30303030)
p += 4;
if (*((uint16_t *) p) == 0x3030)
p += 2;
if (*((uint8_t *) p) == 0x30)
p += 1;
return min(p, &buf[15]);
}
Run Code Online (Sandbox Code Playgroud)
还有一个适用于64位(即AMD64)代码的附加技巧.额外的,更宽的寄存器可以有效地在寄存器中累积每个5位组; 在计算完最后一位数后,您可以将它们与SHRD一起粉碎,或者将它们与0x3030303030303030一起粉碎,并存储到存储器中.这使我的性能提高了大约12.3%.
我们可以在SSE单元上按原样执行上述算法,但性能几乎没有增加.但是,如果我们将值拆分为较小的块,我们可以利用SSE4.1 32位乘法指令.我尝试了三种不同的分裂:
最快的变体是4组3位数.请参阅下面的结果.
除了vitaut和Inge Henriksen建议的算法之外,我测试了许多Terje算法的变体.我通过对输入的详尽测试验证了每个算法的输出与itoa()匹配.
我的号码来自运行Windows 7 64位的Westmere E5640.我以实时优先级为基准并锁定到核心0.我执行每个算法4次以强制所有内容进入缓存.我使用RDTSCP对2 ^ 24个呼叫进行计时,以消除任何动态时钟速度变化的影响.
我计时了5种不同的输入模式:
数据:
ALG TINY MEDIUM LARGE RND256 RND64K NOTES NULL 7 clk 7 clk 7 clk 7 clk 7 clk Benchmark overhead baseline TERJE_C 63 clk 62 clk 63 clk 57 clk 56 clk Best C implementation of Terje's algorithm TERJE_ASM 48 clk 48 clk 50 clk 45 clk 44 clk Naive, hand-written AMD64 version of Terje's algorithm TERJE_SSE 41 clk 42 clk 41 clk 34 clk 35 clk SSE intrinsic version of Terje's algorithm with 1/3/3/3 digit grouping INGE_0 12 clk 31 clk 71 clk 72 clk 72 clk Inge's first algorithm INGE_1 20 clk 23 clk 45 clk 69 clk 96 clk Inge's second algorithm INGE_2 18 clk 19 clk 32 clk 29 clk 36 clk Improved version of Inge's second algorithm VITAUT_0 9 clk 16 clk 32 clk 35 clk 35 clk vitaut's algorithm VITAUT_1 11 clk 15 clk 33 clk 31 clk 30 clk Improved version of vitaut's algorithm LIBC 46 clk 128 clk 329 clk 339 clk 340 clk MSVCRT12 implementation
我的编译器(VS 2013 Update 4)产生了令人惊讶的糟糕代码; Terje算法的汇编版本只是一个简单的翻译,它的速度提高了整整21%.我也对SSE实现的性能感到惊讶,我预计它会更慢.令人惊讶的是INGE_2,VITAUT_0和VITAUT_1的速度有多快.Bravo to vitaut提出了一种便携式解决方案,即便在装配级别上也能做到最好.
注意:INGE_1是Inge Henriksen的第二个算法的修改版本,因为原始版本有错误.
INGE_2基于Inge Henriksen提供的第二种算法.它不是将指针存储在char*[]数组中的预先计算的字符串中,而是将字符串本身存储在char [] [5]数组中.另一个重大改进是如何将字符存储在输出缓冲区中.它存储的字符多于必要的字符数,并使用指针算法返回指向第一个非零字符的指针.结果大大加快 - 即使使用SSE优化版本的Terje算法也具有竞争力.应该注意的是,微基准测试有点偏爱这种算法,因为在实际应用中,600K数据集将不断地吹嘘高速缓存.
VITAUT_1基于vitaut的算法,有两个小的变化.第一个变化是它在主循环中复制字符对,减少了存储指令的数量.与INGE_2类似,VITAUT_1复制两个最终字符并使用指针算法返回指向字符串的指针.
在这里,我为3个最有趣的算法提供代码.
TERJE_ASM:
; char *itoa_terje_asm(char *buf<rcx>, uint32_t val<edx>)
;
; *** NOTE ***
; buf *must* be 8-byte aligned or this code will break!
itoa_terje_asm:
MOV EAX, 0xA7C5AC47
ADD RDX, 1
IMUL RAX, RDX
SHR RAX, 48 ; EAX = val / 100000
IMUL R11D, EAX, 100000
ADD EAX, 1
SUB EDX, R11D ; EDX = (val % 100000) + 1
IMUL RAX, 214748 ; RAX = (val / 100000) * 2^31 / 10000
IMUL RDX, 214748 ; RDX = (val % 100000) * 2^31 / 10000
; Extract buf[0] & buf[5]
MOV R8, RAX
MOV R9, RDX
LEA EAX, [RAX+RAX] ; RAX = (RAX * 2) & 0xFFFFFFFF
LEA EDX, [RDX+RDX] ; RDX = (RDX * 2) & 0xFFFFFFFF
LEA RAX, [RAX+RAX*4] ; RAX *= 5
LEA RDX, [RDX+RDX*4] ; RDX *= 5
SHR R8, 31 ; R8 = buf[0]
SHR R9, 31 ; R9 = buf[5]
; Extract buf[1] & buf[6]
MOV R10, RAX
MOV R11, RDX
LEA EAX, [RAX+RAX] ; RAX = (RAX * 2) & 0xFFFFFFFF
LEA EDX, [RDX+RDX] ; RDX = (RDX * 2) & 0xFFFFFFFF
LEA RAX, [RAX+RAX*4] ; RAX *= 5
LEA RDX, [RDX+RDX*4] ; RDX *= 5
SHR R10, 31 - 8
SHR R11, 31 - 8
AND R10D, 0x0000FF00 ; R10 = buf[1] << 8
AND R11D, 0x0000FF00 ; R11 = buf[6] << 8
OR R10D, R8D ; R10 = buf[0] | (buf[1] << 8)
OR R11D, R9D ; R11 = buf[5] | (buf[6] << 8)
; Extract buf[2] & buf[7]
MOV R8, RAX
MOV R9, RDX
LEA EAX, [RAX+RAX] ; RAX = (RAX * 2) & 0xFFFFFFFF
LEA EDX, [RDX+RDX] ; RDX = (RDX * 2) & 0xFFFFFFFF
LEA RAX, [RAX+RAX*4] ; RAX *= 5
LEA RDX, [RDX+RDX*4] ; RDX *= 5
SHR R8, 31 - 16
SHR R9, 31 - 16
AND R8D, 0x00FF0000 ; R8 = buf[2] << 16
AND R9D, 0x00FF0000 ; R9 = buf[7] << 16
OR R8D, R10D ; R8 = buf[0] | (buf[1] << 8) | (buf[2] << 16)
OR R9D, R11D ; R9 = buf[5] | (buf[6] << 8) | (buf[7] << 16)
; Extract buf[3], buf[4], buf[8], & buf[9]
MOV R10, RAX
MOV R11, RDX
LEA EAX, [RAX+RAX] ; RAX = (RAX * 2) & 0xFFFFFFFF
LEA EDX, [RDX+RDX] ; RDX = (RDX * 2) & 0xFFFFFFFF
LEA RAX, [RAX+RAX*4] ; RAX *= 5
LEA RDX, [RDX+RDX*4] ; RDX *= 5
SHR R10, 31 - 24
SHR R11, 31 - 24
AND R10D, 0xFF000000 ; R10 = buf[3] << 24
AND R11D, 0xFF000000 ; R11 = buf[7] << 24
AND RAX, 0x80000000 ; RAX = buf[4] << 31
AND RDX, 0x80000000 ; RDX = buf[9] << 31
OR R10D, R8D ; R10 = buf[0] | (buf[1] << 8) | (buf[2] << 16) | (buf[3] << 24)
OR R11D, R9D ; R11 = buf[5] | (buf[6] << 8) | (buf[7] << 16) | (buf[8] << 24)
LEA RAX, [R10+RAX*2] ; RAX = buf[0] | (buf[1] << 8) | (buf[2] << 16) | (buf[3] << 24) | (buf[4] << 32)
LEA RDX, [R11+RDX*2] ; RDX = buf[5] | (buf[6] << 8) | (buf[7] << 16) | (buf[8] << 24) | (buf[9] << 32)
; Compact the character strings
SHL RAX, 24 ; RAX = (buf[0] << 24) | (buf[1] << 32) | (buf[2] << 40) | (buf[3] << 48) | (buf[4] << 56)
MOV R8, 0x3030303030303030
SHRD RAX, RDX, 24 ; RAX = buf[0] | (buf[1] << 8) | (buf[2] << 16) | (buf[3] << 24) | (buf[4] << 32) | (buf[5] << 40) | (buf[6] << 48) | (buf[7] << 56)
SHR RDX, 24 ; RDX = buf[8] | (buf[9] << 8)
; Store 12 characters. The last 2 will be null bytes.
OR R8, RAX
LEA R9, [RDX+0x3030]
MOV [RCX], R8
MOV [RCX+8], R9D
; Convert RCX into a bit pointer.
SHL RCX, 3
; Scan the first 8 bytes for a non-zero character.
OR EDX, 0x00000100
TEST RAX, RAX
LEA R10, [RCX+64]
CMOVZ RAX, RDX
CMOVZ RCX, R10
; Scan the next 4 bytes for a non-zero character.
TEST EAX, EAX
LEA R10, [RCX+32]
CMOVZ RCX, R10
SHR RAX, CL ; N.B. RAX >>= (RCX % 64); this works because buf is 8-byte aligned.
; Scan the next 2 bytes for a non-zero character.
TEST AX, AX
LEA R10, [RCX+16]
CMOVZ RCX, R10
SHR EAX, CL ; N.B. RAX >>= (RCX % 32)
; Convert back to byte pointer. N.B. this works because the AMD64 virtual address space is 48-bit.
SAR RCX, 3
; Scan the last byte for a non-zero character.
TEST AL, AL
MOV RAX, RCX
LEA R10, [RCX+1]
CMOVZ RAX, R10
RETN
Run Code Online (Sandbox Code Playgroud)
INGE_2:
uint8_t len100K[100000];
char str100K[100000][5];
void itoa_inge_2_init()
{
memset(str100K, '0', sizeof(str100K));
for(uint32_t i = 0; i < 100000; i++)
{
char buf[6];
itoa(i, buf, 10);
len100K[i] = strlen(buf);
memcpy(&str100K[i][5 - len100K[i]], buf, len100K[i]);
}
}
char *itoa_inge_2(char *buf, uint32_t val)
{
char *p = &buf[10];
uint32_t prevlen;
*p = '\0';
do
{
uint32_t const old = val;
uint32_t mod;
val /= 100000;
mod = old - (val * 100000);
prevlen = len100K[mod];
p -= 5;
memcpy(p, str100K[mod], 5);
}
while(val != 0);
return &p[5 - prevlen];
}
Run Code Online (Sandbox Code Playgroud)
VITAUT_1:
static uint16_t const str100p[100] = {
0x3030, 0x3130, 0x3230, 0x3330, 0x3430, 0x3530, 0x3630, 0x3730, 0x3830, 0x3930,
0x3031, 0x3131, 0x3231, 0x3331, 0x3431, 0x3531, 0x3631, 0x3731, 0x3831, 0x3931,
0x3032, 0x3132, 0x3232, 0x3332, 0x3432, 0x3532, 0x3632, 0x3732, 0x3832, 0x3932,
0x3033, 0x3133, 0x3233, 0x3333, 0x3433, 0x3533, 0x3633, 0x3733, 0x3833, 0x3933,
0x3034, 0x3134, 0x3234, 0x3334, 0x3434, 0x3534, 0x3634, 0x3734, 0x3834, 0x3934,
0x3035, 0x3135, 0x3235, 0x3335, 0x3435, 0x3535, 0x3635, 0x3735, 0x3835, 0x3935,
0x3036, 0x3136, 0x3236, 0x3336, 0x3436, 0x3536, 0x3636, 0x3736, 0x3836, 0x3936,
0x3037, 0x3137, 0x3237, 0x3337, 0x3437, 0x3537, 0x3637, 0x3737, 0x3837, 0x3937,
0x3038, 0x3138, 0x3238, 0x3338, 0x3438, 0x3538, 0x3638, 0x3738, 0x3838, 0x3938,
0x3039, 0x3139, 0x3239, 0x3339, 0x3439, 0x3539, 0x3639, 0x3739, 0x3839, 0x3939, };
char *itoa_vitaut_1(char *buf, uint32_t val)
{
char *p = &buf[10];
*p = '\0';
while(val >= 100)
{
uint32_t const old = val;
p -= 2;
val /= 100;
memcpy(p, &str100p[old - (val * 100)], sizeof(uint16_t));
}
p -= 2;
memcpy(p, &str100p[val], sizeof(uint16_t));
return &p[val < 10];
}
Run Code Online (Sandbox Code Playgroud)
优化代码的第一步是摆脱任意的基础支持.这是因为除以常数几乎肯定是乘法,但是除以除法base是因为'0'+n比"0123456789abcdef"[n](前者没有记忆)更快.
如果你需要超越它,你可以为你关心的基础中的每个字节创建查找表(例如10),然后向每个字节添加(例如十进制)结果.如:
00 02 00 80 (input)
0000000000 (place3[0x00])
+0000131072 (place2[0x02])
+0000000000 (place1[0x00])
+0000000128 (place0[0x80])
==========
0000131200 (result)
Run Code Online (Sandbox Code Playgroud)