相关疑难解决方法(0)

启用优化后，为什么此代码慢6.5倍？

我想基准glibc的strlen功能，出于某种原因，发现它显然执行多慢与GCC启用优化，我不知道为什么。

这是我的代码：

#include <time.h>
#include <string.h>
#include <stdlib.h>
#include <stdio.h>

int main() {
    char *s = calloc(1 << 20, 1);
    memset(s, 65, 1000000);
    clock_t start = clock();
    for (int i = 0; i < 128; ++i) {
        s[strlen(s)] = 'A';
    }
    clock_t end = clock();
    printf("%lld\n", (long long)(end - start));
    return 0;
}

Run Code Online (Sandbox Code Playgroud)

在我的机器上，它输出：

$ gcc test.c && ./a.out
13336
$ gcc -O1 test.c && ./a.out
199004
$ gcc -O2 test.c && ./a.out
83415 …

Run Code Online (Sandbox Code Playgroud)

c performance gcc glibc

Tsa*_*arN

2019 10-24

64
推荐指数

2
解决办法

3997
查看次数

在x86和x64上读取同一页面内的缓冲区末尾是否安全？

如果允许在输入缓冲区末尾读取少量数据,则可以(并且)简化在高性能算法中找到的许多方法.这里,"少量"通常意味着W - 1超过结束的字节,其中W是算法的字节大小(例如,对于处理64位块中的输入的算法,最多7个字节).

很明显,写入输入缓冲区的末尾通常是不安全的,因为您可能会破坏缓冲区¹之外的数据.同样清楚的是,在缓冲区的末尾读取到另一页面可能会触发分段错误/访问冲突,因为下一页可能不可读.

但是,在读取对齐值的特殊情况下,页面错误似乎是不可能的,至少在x86上是这样.在该平台上,页面(以及因此内存保护标志)具有4K粒度(较大的页面,例如2MiB或1GiB,可能,但这些是4K的倍数),因此对齐的读取将仅访问与有效页面相同的页面中的字节缓冲区的一部分.

这是一个循环的规范示例,它对齐其输入并在缓冲区末尾读取最多7个字节:

int processBytes(uint8_t *input, size_t size) {

    uint64_t *input64 = (uint64_t *)input, end64 = (uint64_t *)(input + size);
    int res;

    if (size < 8) {
        // special case for short inputs that we aren't concerned with here
        return shortMethod();
    }

    // check the first 8 bytes
    if ((res = match(*input)) >= 0) {
        return input + res;
    }

    // align pointer to the next 8-byte …

Run Code Online (Sandbox Code Playgroud)

c optimization performance x86 assembly

Bee*_*ope

2017 05-23

33
推荐指数

2
解决办法

2027
查看次数