nol*_*dda 7 c memory-alignment c-preprocessor
假设有类似的东西:
void mask_bytes(unsigned char* dest, unsigned char* src, unsigned char* mask, unsigned int len)
{
unsigned int i;
for(i=0; i<len; i++)
{
dest[i] = src[i] & mask[i];
}
}
Run Code Online (Sandbox Code Playgroud)
通过编写类似下面的内容,我可以更快地在非对齐访问机器(例如x86)上运行
void mask_bytes(unsigned char* dest, unsigned char* src, unsigned char* mask, unsigned int len)
{
unsigned int i;
unsigned int wordlen = len >> 2;
for(i=0; i<wordlen; i++)
{
((uint32_t*)dest)[i] = ((uint32_t*)src)[i] & ((uint32_t*)mask)[i]; // this raises SIGBUS on SPARC and other archs that require aligned access.
}
for(i=wordlen<<2; i<len; i++){
dest[i] = src[i] & mask[i];
}
}
Run Code Online (Sandbox Code Playgroud)
然而,它需要建立在几个架构上,所以我想做一些像:
void mask_bytes(unsigned char* dest, unsigned char* src, unsigned char* mask, unsigned int len)
{
unsigned int i;
unsigned int wordlen = len >> 2;
#if defined(__ALIGNED2__) || defined(__ALIGNED4__) || defined(__ALIGNED8__)
// go slow
for(i=0; i<len; i++)
{
dest[i] = src[i] & mask[i];
}
#else
// go fast
for(i=0; i<wordlen; i++)
{
// the following line will raise SIGBUS on SPARC and other archs that require aligned access.
((uint32_t*)dest)[i] = ((uint32_t*)src)[i] & ((uint32_t*)mask)[i];
}
for(i=wordlen<<2; i<len; i++){
dest[i] = src[i] & mask[i];
}
#endif
}
Run Code Online (Sandbox Code Playgroud)
但我找不到任何有关编译器定义的宏的信息(如我__ALIGNED4__上面的假设),它指定了对齐或使用预处理器确定目标架构对齐的任何巧妙方法.我可以测试defined (__SVR4) && defined (__sun),但我更喜欢在其他需要对齐内存访问的架构上使用Just Work TM的东西.
虽然x86默默地修复了未对齐的访问,但这对性能来说并不是最佳选择.通常最好假设某个对齐并自己执行修正:
unsigned int const alignment = 8; /* or 16, or sizeof(long) */
void memcpy(char *dst, char const *src, unsigned int size) {
if((((intptr_t)dst) % alignment) != (((intptr_t)src) % alignment)) {
/* no common alignment, copy as bytes or shift around */
} else {
if(((intptr_t)dst) % alignment) {
/* copy bytes at the beginning */
}
/* copy words in the middle */
if(((intptr_t)dst + size) % alignment) {
/* copy bytes at the end */
}
}
}
Run Code Online (Sandbox Code Playgroud)
另外,请查看SIMD说明.