相关疑难解决方法(0)

为什么要快速运行glibc的问题太复杂了？

我在这里浏览strlen代码，想知道是否真的需要代码中使用的优化？例如，为什么下面这样的东西不能同样好或更好？

unsigned long strlen(char s[]) {
    unsigned long i;
    for (i = 0; s[i] != '\0'; i++)
        continue;
    return i;
}

Run Code Online (Sandbox Code Playgroud)

较简单的代码对编译器进行优化是否更好或更容易？

strlen链接后面页面上的代码如下所示：

/* Copyright (C) 1991, 1993, 1997, 2000, 2003 Free Software Foundation, Inc.
   This file is part of the GNU C Library.
   Written by Torbjorn Granlund (tege@sics.se),
   with help from Dan Sahlin (dan@sics.se);
   commentary by Jim Blandy (jimb@ai.mit.edu).

   The GNU C Library is free software; you can redistribute it and/or
   modify it under …

Run Code Online (Sandbox Code Playgroud)

c optimization portability glibc strlen

作者

2019 08-29

283
推荐指数

7
解决办法

5万
查看次数

为什么strcmp没有SIMD优化？

我试图在x64计算机上编译这个程序:

#include <cstring>

int main(int argc, char* argv[])
{
  return ::std::strcmp(argv[0],
    "really really really really really really really really really"
    "really really really really really really really really really"
    "really really really really really really really really really"
    "really really really really really really really really really"
    "really really really really really really really really really"
    "really really really really really really really really really"
    "really really really really really really really really really"
    "really really really really really really …

Run Code Online (Sandbox Code Playgroud)

c++ sse simd strcmp sse2

use*_*108

2014 10-27

36
推荐指数

3
解决办法

7563
查看次数

在x86和x64上读取同一页面内的缓冲区末尾是否安全？

如果允许在输入缓冲区末尾读取少量数据,则可以(并且)简化在高性能算法中找到的许多方法.这里,"少量"通常意味着W - 1超过结束的字节,其中W是算法的字节大小(例如,对于处理64位块中的输入的算法,最多7个字节).

很明显,写入输入缓冲区的末尾通常是不安全的,因为您可能会破坏缓冲区¹之外的数据.同样清楚的是,在缓冲区的末尾读取到另一页面可能会触发分段错误/访问冲突,因为下一页可能不可读.

但是,在读取对齐值的特殊情况下,页面错误似乎是不可能的,至少在x86上是这样.在该平台上,页面(以及因此内存保护标志)具有4K粒度(较大的页面,例如2MiB或1GiB,可能,但这些是4K的倍数),因此对齐的读取将仅访问与有效页面相同的页面中的字节缓冲区的一部分.

这是一个循环的规范示例,它对齐其输入并在缓冲区末尾读取最多7个字节:

int processBytes(uint8_t *input, size_t size) {

    uint64_t *input64 = (uint64_t *)input, end64 = (uint64_t *)(input + size);
    int res;

    if (size < 8) {
        // special case for short inputs that we aren't concerned with here
        return shortMethod();
    }

    // check the first 8 bytes
    if ((res = match(*input)) >= 0) {
        return input + res;
    }

    // align pointer to the next 8-byte …

Run Code Online (Sandbox Code Playgroud)

c optimization performance x86 assembly

Bee*_*ope

2017 05-23

33
推荐指数

2
解决办法

2027
查看次数

rep stosb指令如何比等效循环执行得更快？

指令如何rep stosb比这段代码执行得更快？

    Clear: mov byte [edi],AL       ; Write the value in AL to memory
           inc edi                 ; Bump EDI to next byte in the buffer
           dec ecx                 ; Decrement ECX by one position
           jnz Clear               ; And loop again until ECX is 0

Run Code Online (Sandbox Code Playgroud)

在所有现代CPU上都能保证这一点吗？我是否应该总是喜欢使用rep stosb而不是手动编写循环？

optimization performance x86 assembly micro-optimization

Pro*_*ala

2018 11-27

13
推荐指数

2
解决办法

6251
查看次数

使用 strlen 与停止为零的字符串操作的性能

当我用 C++ 为字符串编写一个类时，我发现了一个关于执行速度的奇怪行为。我将以upper方法的以下两个实现为例：

class String {

    char* str;

    ...

    forceinline void upperStrlen();
    forceinline void upperPtr();
};

void String::upperStrlen()
{
    INDEX length = strlen(str);

    for (INDEX i = 0; i < length; i++) {
        str[i] = toupper(str[i]);
    }
}

void String::upperPtr()
{
    char* ptr_char = str;

    for (; *ptr_char != '\0'; ptr_char++) {
        *ptr_char = toupper(*ptr_char);
    }
}

Run Code Online (Sandbox Code Playgroud)

INDEX 是 uint_fast32_t 的简单类型定义。

现在我可以在 main.cpp 中测试这些方法的速度：

#define TEST_RECURSIVE(_function)                    \
{                                                    \
    bool ok = true;                                  \
    clock_t before = clock();                        \
    for …

Run Code Online (Sandbox Code Playgroud)

c++ string performance gcc x86-64

Eni*_*gma

2020 08-24

6
推荐指数

1
解决办法

190
查看次数

为什么调用 snprintf() 这么慢？

我们的内部程序是用 C 编写的，并广泛使用了snprintf()许多部分，我注意到在使用性能记录/报告进行调试期间，它在以下方面花费了大量时间：

       \xe2\x94\x82      _IO_vfprintf_internal():                                                                                                                                                                                             \xe2\x96\x92\n       \xe2\x94\x82        mov    -0x510(%rbp),%rdx                                                                                                                                                                                           \xe2\x96\x92\n       \xe2\x94\x82        mov    %r12,%rsi                                                                                                                                                                                                   \xe2\x96\x92\n       \xe2\x94\x82        mov    %r15,%rdi                                                                                                                                                                                                   \xe2\x96\x92\n       \xe2\x94\x82      \xe2\x86\x92 callq  *0x38(%rax)                                                                                                                                                                                                 \xe2\x96\x92\n       \xe2\x94\x82        cmp    %rax,-0x510(%rbp)                                                                                                                                                                                           \xe2\x96\x92\n       \xe2\x94\x82        mov    -0x530(%rbp),%r9                                                                                                                                                                                            \xe2\x96\x92\n       \xe2\x94\x82      \xe2\x86\x91 jne    91a                                                                                                                                                                                                         \xe2\x96\x92\n       \xe2\x94\x82        mov    -0x4d0(%rbp),%esi                                                                                                                                                                                           \xe2\x96\x92\n       \xe2\x94\x82        mov    -0x540(%rbp),%ecx                                                                                                                                                                                           \xe2\x96\x92\n       \xe2\x94\x82        mov    $0x7fffffff,%eax                                                                                                                                                                                            \xe2\x96\x92\n       \xe2\x94\x82        sub    %esi,%eax                                                                                                                                                                                                   \xe2\x96\x92\n       \xe2\x94\x82        add    %esi,%ecx                                                                                                                                                                                                   \xe2\x96\x92\n       \xe2\x94\x82        cltq                                                                                                                                                                                                               \xe2\x96\x92\n       \xe2\x94\x82        cmp    %rax,-0x510(%rbp)                                                                                                                                                                                           \xe2\x96\x92\n       \xe2\x94\x82      \xe2\x86\x91 jbe    252b                                                                                                                                                                                                        \xe2\x96\x92\n       \xe2\x94\x82      \xe2\x86\x91 jmpq   28f0                                                                                                                                                                                                        \xe2\x96\x92\n       \xe2\x94\x824a70:   xor    %eax,%eax                                                                                                                                                                                                   \xe2\x96\x92\n …

Run Code Online (Sandbox Code Playgroud)

c performance x86 assembly gcc

use*_*031

2022 04-16

6
推荐指数

1
解决办法

1260
查看次数

GCC 优化标志 -O2 使代码比 -O0 慢得多

这是我的代码。

foo()很简单。它将参数与一些字符串一一比较。

main()有点复杂。它只是foo()用不同的字符串调用并计时，3 次。

#include <string.h>
#include <time.h>
#include <stdio.h>
int foo(const char *s)
{
    int r = 0;
    if (strcmp(s, "a11111111") == 0) {
        r = 1;
    } else if (strcmp(s, "b11111111") == 0) {
        r = 2;
    } else if (strcmp(s, "c11111111") == 0) {
        r = 3;
    } else if (strcmp(s, "d11111111") == 0) {
        r = 4;
    } else if (strcmp(s, "e11111111") == 0) {
        r = 5;
    } else if (strcmp(s, …

Run Code Online (Sandbox Code Playgroud)

optimization gcc x86-64

Bin*_* Wu

2021 03-12

6
推荐指数

0
解决办法

61
查看次数

是什么阻止编译器优化手写的 memcmp()？

鉴于：

#include <string.h>

bool test_data(void *data)
{
    return memcmp(data, "abcd", 4) == 0;
}

Run Code Online (Sandbox Code Playgroud)

编译器可以将其优化为：

test_data:
    cmpl    $1684234849, (%rdi)
    sete    %al
    ret

Run Code Online (Sandbox Code Playgroud)

这很好。

但如果我使用我自己的memcmp()（而不是来自<string.h>），编译器无法将其优化为单个cmpl指令。相反，它这样做：

test_data:
    cmpl    $1684234849, (%rdi)
    sete    %al
    ret

Run Code Online (Sandbox Code Playgroud)

test_data:
    cmpb    $97, (%rdi)
    jne     .L5
    cmpb    $98, 1(%rdi)
    jne     .L5
    cmpb    $99, 2(%rdi)
    jne     .L5
    cmpb    $100, 3(%rdi)
    sete    %al
    ret
.L5:
    xorl    %eax, %eax
    ret

Run Code Online (Sandbox Code Playgroud)

链接： https: //godbolt.org/z/Kfhchr45a

是什么阻止编译器进一步优化它？
我是否做了一些阻碍优化的事情？

c optimization assembly x86-64 memcmp

Amm*_*izi

2023 09-01

5
推荐指数

2
解决办法

333
查看次数

为什么 Java 矢量 API 与标量相比如此慢？

我最近决定尝试一下 Java 的新孵化矢量 API，看看它能达到多快。我实现了两种相当简单的方法，一种用于解析 int，另一种用于查找字符串中字符的索引。在这两种情况下，与标量方法相比，我的矢量化方法都慢得令人难以置信。

这是我的代码：

public class SIMDParse {

private static IntVector mul = IntVector.fromArray(
        IntVector.SPECIES_512,
        new int[] {0, 0, 0, 0, 0, 0, 1000000000, 100000000, 10000000, 1000000, 100000, 10000, 1000, 100, 10, 1},
        0
);
private static byte zeroChar = (byte) '0';
private static int width = IntVector.SPECIES_512.length();
private static byte[] filler;

static {
    filler = new byte[16];
    for (int i = 0; i < 16; i++) {
        filler[i] = zeroChar;
    }
}

public static int parseInt(String str) …

Run Code Online (Sandbox Code Playgroud)

java simd vectorization

Red*_*mpt

lucky-day

3
推荐指数

1
解决办法

2016
查看次数

诱导GCC发出REPE CMPSB

如何哄骗GCC编译器在普通C中发出REPE CMPSB指令,而没有"asm"和"_emit"关键字,调用包含的库和编译器内在函数？

我尝试了一些像下面列出的C代码,但没有成功:

unsigned int repe_cmpsb(unsigned char *esi, unsigned char *edi, unsigned int ecx) {

    for (; ((*esi == *edi) && (ecx != 0)); esi++, edi++, ecx--); 

    return ecx;
}

Run Code Online (Sandbox Code Playgroud)

请参阅GCC如何在此链接上编译它:https:
//godbolt.org/g/obJbpq

PS
我意识到无法保证编译器以某种方式编译C代码,但我还是想哄它以获得乐趣,只是为了看它有多聪明.

c performance x86 assembly gcc

Geo*_*son

2018 03-20

2
推荐指数

1
解决办法

415
查看次数