我在这里浏览strlen代码,想知道是否真的需要代码中使用的优化?例如,为什么下面这样的东西不能同样好或更好?
unsigned long strlen(char s[]) {
unsigned long i;
for (i = 0; s[i] != '\0'; i++)
continue;
return i;
}
Run Code Online (Sandbox Code Playgroud)
较简单的代码对编译器进行优化是否更好或更容易?
strlen链接后面页面上的代码如下所示:
Run Code Online (Sandbox Code Playgroud)/* Copyright (C) 1991, 1993, 1997, 2000, 2003 Free Software Foundation, Inc. This file is part of the GNU C Library. Written by Torbjorn Granlund (tege@sics.se), with help from Dan Sahlin (dan@sics.se); commentary by Jim Blandy (jimb@ai.mit.edu). The GNU C Library is free software; you can redistribute it and/or modify it under …
我试图在x64计算机上编译这个程序:
#include <cstring>
int main(int argc, char* argv[])
{
return ::std::strcmp(argv[0],
"really really really really really really really really really"
"really really really really really really really really really"
"really really really really really really really really really"
"really really really really really really really really really"
"really really really really really really really really really"
"really really really really really really really really really"
"really really really really really really really really really"
"really really really really really really …Run Code Online (Sandbox Code Playgroud) 如果允许在输入缓冲区末尾读取少量数据,则可以(并且)简化在高性能算法中找到的许多方法.这里,"少量"通常意味着W - 1超过结束的字节,其中W是算法的字节大小(例如,对于处理64位块中的输入的算法,最多7个字节).
很明显,写入输入缓冲区的末尾通常是不安全的,因为您可能会破坏缓冲区1之外的数据.同样清楚的是,在缓冲区的末尾读取到另一页面可能会触发分段错误/访问冲突,因为下一页可能不可读.
但是,在读取对齐值的特殊情况下,页面错误似乎是不可能的,至少在x86上是这样.在该平台上,页面(以及因此内存保护标志)具有4K粒度(较大的页面,例如2MiB或1GiB,可能,但这些是4K的倍数),因此对齐的读取将仅访问与有效页面相同的页面中的字节缓冲区的一部分.
这是一个循环的规范示例,它对齐其输入并在缓冲区末尾读取最多7个字节:
int processBytes(uint8_t *input, size_t size) {
uint64_t *input64 = (uint64_t *)input, end64 = (uint64_t *)(input + size);
int res;
if (size < 8) {
// special case for short inputs that we aren't concerned with here
return shortMethod();
}
// check the first 8 bytes
if ((res = match(*input)) >= 0) {
return input + res;
}
// align pointer to the next 8-byte …Run Code Online (Sandbox Code Playgroud) 指令如何rep stosb比这段代码执行得更快?
Clear: mov byte [edi],AL ; Write the value in AL to memory
inc edi ; Bump EDI to next byte in the buffer
dec ecx ; Decrement ECX by one position
jnz Clear ; And loop again until ECX is 0
Run Code Online (Sandbox Code Playgroud)
在所有现代CPU上都能保证这一点吗?我是否应该总是喜欢使用rep stosb而不是手动编写循环?
当我用 C++ 为字符串编写一个类时,我发现了一个关于执行速度的奇怪行为。我将以upper方法的以下两个实现为例:
class String {
char* str;
...
forceinline void upperStrlen();
forceinline void upperPtr();
};
void String::upperStrlen()
{
INDEX length = strlen(str);
for (INDEX i = 0; i < length; i++) {
str[i] = toupper(str[i]);
}
}
void String::upperPtr()
{
char* ptr_char = str;
for (; *ptr_char != '\0'; ptr_char++) {
*ptr_char = toupper(*ptr_char);
}
}
Run Code Online (Sandbox Code Playgroud)
INDEX 是 uint_fast32_t 的简单类型定义。
现在我可以在 main.cpp 中测试这些方法的速度:
#define TEST_RECURSIVE(_function) \
{ \
bool ok = true; \
clock_t before = clock(); \
for …Run Code Online (Sandbox Code Playgroud) 我们的内部程序是用 C 编写的,并广泛使用了snprintf()许多部分,我注意到在使用性能记录/报告进行调试期间,它在以下方面花费了大量时间:
\xe2\x94\x82 _IO_vfprintf_internal(): \xe2\x96\x92\n \xe2\x94\x82 mov -0x510(%rbp),%rdx \xe2\x96\x92\n \xe2\x94\x82 mov %r12,%rsi \xe2\x96\x92\n \xe2\x94\x82 mov %r15,%rdi \xe2\x96\x92\n \xe2\x94\x82 \xe2\x86\x92 callq *0x38(%rax) \xe2\x96\x92\n \xe2\x94\x82 cmp %rax,-0x510(%rbp) \xe2\x96\x92\n \xe2\x94\x82 mov -0x530(%rbp),%r9 \xe2\x96\x92\n \xe2\x94\x82 \xe2\x86\x91 jne 91a \xe2\x96\x92\n \xe2\x94\x82 mov -0x4d0(%rbp),%esi \xe2\x96\x92\n \xe2\x94\x82 mov -0x540(%rbp),%ecx \xe2\x96\x92\n \xe2\x94\x82 mov $0x7fffffff,%eax \xe2\x96\x92\n \xe2\x94\x82 sub %esi,%eax \xe2\x96\x92\n \xe2\x94\x82 add %esi,%ecx \xe2\x96\x92\n \xe2\x94\x82 cltq \xe2\x96\x92\n \xe2\x94\x82 cmp %rax,-0x510(%rbp) \xe2\x96\x92\n \xe2\x94\x82 \xe2\x86\x91 jbe 252b \xe2\x96\x92\n \xe2\x94\x82 \xe2\x86\x91 jmpq 28f0 \xe2\x96\x92\n \xe2\x94\x824a70: xor %eax,%eax \xe2\x96\x92\n …Run Code Online (Sandbox Code Playgroud) 这是我的代码。
foo()很简单。它将参数与一些字符串一一比较。
main()有点复杂。它只是foo()用不同的字符串调用并计时,3 次。
#include <string.h>
#include <time.h>
#include <stdio.h>
int foo(const char *s)
{
int r = 0;
if (strcmp(s, "a11111111") == 0) {
r = 1;
} else if (strcmp(s, "b11111111") == 0) {
r = 2;
} else if (strcmp(s, "c11111111") == 0) {
r = 3;
} else if (strcmp(s, "d11111111") == 0) {
r = 4;
} else if (strcmp(s, "e11111111") == 0) {
r = 5;
} else if (strcmp(s, …Run Code Online (Sandbox Code Playgroud) 鉴于:
#include <string.h>
bool test_data(void *data)
{
return memcmp(data, "abcd", 4) == 0;
}
Run Code Online (Sandbox Code Playgroud)
编译器可以将其优化为:
test_data:
cmpl $1684234849, (%rdi)
sete %al
ret
Run Code Online (Sandbox Code Playgroud)
这很好。
但如果我使用我自己的memcmp()(而不是来自<string.h>),编译器无法将其优化为单个cmpl指令。相反,它这样做:
test_data:
cmpl $1684234849, (%rdi)
sete %al
ret
Run Code Online (Sandbox Code Playgroud)
test_data:
cmpb $97, (%rdi)
jne .L5
cmpb $98, 1(%rdi)
jne .L5
cmpb $99, 2(%rdi)
jne .L5
cmpb $100, 3(%rdi)
sete %al
ret
.L5:
xorl %eax, %eax
ret
Run Code Online (Sandbox Code Playgroud)
链接: https: //godbolt.org/z/Kfhchr45a
我最近决定尝试一下 Java 的新孵化矢量 API,看看它能达到多快。我实现了两种相当简单的方法,一种用于解析 int,另一种用于查找字符串中字符的索引。在这两种情况下,与标量方法相比,我的矢量化方法都慢得令人难以置信。
这是我的代码:
public class SIMDParse {
private static IntVector mul = IntVector.fromArray(
IntVector.SPECIES_512,
new int[] {0, 0, 0, 0, 0, 0, 1000000000, 100000000, 10000000, 1000000, 100000, 10000, 1000, 100, 10, 1},
0
);
private static byte zeroChar = (byte) '0';
private static int width = IntVector.SPECIES_512.length();
private static byte[] filler;
static {
filler = new byte[16];
for (int i = 0; i < 16; i++) {
filler[i] = zeroChar;
}
}
public static int parseInt(String str) …Run Code Online (Sandbox Code Playgroud) 如何哄骗GCC编译器在普通C中发出REPE CMPSB指令,而没有"asm"和"_emit"关键字,调用包含的库和编译器内在函数?
我尝试了一些像下面列出的C代码,但没有成功:
unsigned int repe_cmpsb(unsigned char *esi, unsigned char *edi, unsigned int ecx) {
for (; ((*esi == *edi) && (ecx != 0)); esi++, edi++, ecx--);
return ecx;
}
Run Code Online (Sandbox Code Playgroud)
请参阅GCC如何在此链接上编译它:https:
//godbolt.org/g/obJbpq
PS
我意识到无法保证编译器以某种方式编译C代码,但我还是想哄它以获得乐趣,只是为了看它有多聪明.