小编Huy*_* Le的帖子

C++ 从数组定义行为中间的指针进行负索引？

#include <iostream>
using namespace std;

const int BUFSIZE = 1 << 20;
char padded_buffer[64 + BUFSIZE + 64];
char* buffer = padded_buffer + 64;

int main()
{
    buffer[-1] = '?';
    // is that always equivalent to padded_buffer[63] = '?' ?
    cout << padded_buffer[63] << "\n";
    return 0;
}

Run Code Online (Sandbox Code Playgroud)

我有一段像上面这样的代码。基本上，由于某些原因，我需要“隔离”阵列的两侧。

但我想知道上面的语法是否安全？我知道负索引通常是未定义的行为，但是这种情况又如何呢？

c++ arrays string indexing undefined-behavior

Huy*_* Le

2022 10-20

9
推荐指数

1
解决办法

417
查看次数

为什么这个自旋锁需要 memory_order_acquire_release 而不是仅仅获取？

// spinlockAcquireRelease.cpp

#include <atomic>
#include <thread>

class Spinlock{
  std::atomic_flag flag;
public:
  Spinlock(): flag(ATOMIC_FLAG_INIT) {}

  void lock(){
    while(flag.test_and_set(std::memory_order_acquire) ); // line 12
  }

  void unlock(){
    flag.clear(std::memory_order_release);
  }
};

Spinlock spin;

void workOnResource(){
  spin.lock();
  // shared resource
  spin.unlock();
}


int main(){

  std::thread t(workOnResource);
  std::thread t2(workOnResource);

  t.join();
  t2.join();

}

Run Code Online (Sandbox Code Playgroud)

注释中说：

如果两个以上的线程使用自旋锁，则 lock 方法的获取语义是不够的。现在lock方法是一个获取-释放操作。因此第 12 行 [对的调用] 中的内存模型flag.test_and_set(std::memory_order_acquire)必须更改为std::memory_order_acq_rel。

为什么这个自旋锁适用于 2 个线程，但不适用于超过 2 个线程？导致此自旋锁出错的示例代码是什么？

来源： https: //www.modernnescpp.com/index.php/acquire-release-semantic

c++ multithreading mutex atomic spinlock

Huy*_* Le

2022 02-02

7
推荐指数

1
解决办法

421
查看次数

汇编为什么是“lea eax, [eax + eax*const]; shl eax, eax, const;” 根据 gcc -O2，组合速度比“imul eax, eax, const”更快？

我正在使用 godbolt 来组装以下程序：

#include <stdio.h>\nvolatile int a = 5;\nvolatile int res = 0;\nint main() {\n    res = a * 36;\n    return 1;\n}\n

Run Code Online (Sandbox Code Playgroud)\n

如果我使用-Os优化，生成的代码很自然：

mov     eax, DWORD PTR a[rip]\nimul    eax, eax, 36\nmov     DWORD PTR res[rip], eax\n

Run Code Online (Sandbox Code Playgroud)\n

但如果我使用-O2，生成的代码是这样的：

mov     eax, DWORD PTR a[rip]\nlea     eax, [rax+rax*8]\nsal     eax, 2\nmov     DWORD PTR res[rip], eax\n

Run Code Online (Sandbox Code Playgroud)\n

因此，它不是乘以 5*36，而是执行 5 -> 5+5*8=45 -> 45*4 = 180。我认为这是因为 1 imul 比 1 lea + 1 左移慢。

但在 lea 指令中，需要计算rax+rax*8，其中包含 1 个加法 …

c optimization assembly x86-64 cpu-architecture

Huy*_* Le

2021 12-13

6
推荐指数

1
解决办法

711
查看次数

将 9 个字符数字转换为 int 或 unsigned int 的最疯狂的快速方法

#include <stdio.h>
#include <iostream>
#include <string>
#include <chrono>
#include <memory>
#include <cstdlib>
#include <cstdint>
#include <cstring>
#include <immintrin.h>
using namespace std;

const int p[9] =   {1, 10, 100, 
                    1000, 10000, 100000, 
                    1000000, 10000000, 100000000};
                    
class MyTimer {
 private:
  std::chrono::time_point<std::chrono::steady_clock> starter;

 public:
  void startCounter() {
    starter = std::chrono::steady_clock::now();
  }

  int64_t getCounterNs() {    
    return std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::steady_clock::now() - starter).count();
  }
};
                    
int convert1(const char *a) {
    int res = 0;
    for (int i=0; i<9; i++) res = res * 10 + a[i] - 48; …

Run Code Online (Sandbox Code Playgroud)

c++ optimization assembly sse x86-64

Huy*_* Le

2023 03-09

6
推荐指数

2
解决办法

1363
查看次数

C++17 根据文件路径自动创建目录

#include <iostream>
#include <fstream>
using namespace std;
    
int main()
{
    ofstream fo("output/folder1/data/today/log.txt");
    fo << "Hello world\n";
    fo.close();
    
    return 0;
}

Run Code Online (Sandbox Code Playgroud)

我需要将一些日志数据输出到一些具有变量名称的文件中。但是，ofstream不会一路创建目录，如果文件的路径不存在，则不会ofstream写入任何地方！

如何沿着文件路径自动创建文件夹？系统只有Ubuntu。

c++ filesystems ubuntu file filepath

Huy*_* Le

2022 07-27

6
推荐指数

1
解决办法

1万
查看次数

cudaMalloc() 是否将数组初始化为 0？

或者如果我想确保数组包含全0，我是否需要执行cudaMemset()？我在文档中找不到它。谢谢。

c c++ malloc cuda memset

Huy*_* Le

lucky-day

5
推荐指数

1
解决办法

3007
查看次数

使用 const 运行时除数进行快速整数除法和取模

int n_attrs = some_input_from_other_function() // [2..5000]
vector<int> corr_indexes; // size = n_attrs * n_attrs
vector<char> selected; // szie = n_attrs
vector<pair<int,int>> selectedPairs; // size = n_attrs / 2
// vector::reserve everything here
...
// optimize the code below
const int npairs = n_attrs * n_attrs;
selectedPairs.clear();
for (int i = 0; i < npairs; i++) {
    const int x = corr_indexes[i] / n_attrs;
    const int y = corr_indexes[i] % n_attrs;
    if (selected[x] || selected[y]) continue; // fit inside L1 cache …

Run Code Online (Sandbox Code Playgroud)

c++ math optimization cuda integer-division

Huy*_* Le

2022 09-10

5
推荐指数

1
解决办法

2232
查看次数

使用 SIMD 屏蔽高于分隔符位置的字节的最快方法

uint8_t data[] = "mykeyxyz:1234\nky:123\n...";。我的字符串行有 format key:value，其中每一行都有len(key) <= 16保证。我想加载mykeyxyz到 a 中__m128i，但将较高的位置填为 0。

最简单的方法是使用 255 或 0 掩码的数组，但这需要另一个内存负载。有没有办法更快地做到这一点？

接受的答案使总程序时间加快了约 2%。要进行比较，请进行测试1brc_valid13.cpp（1brc_valid14.cpp使用已接受的答案）。硬件：AMD 2950X、Ubuntu 18.04、g++ 11.4，编译命令：g++ -o main 1brc_final_valid.cpp -O3 -std=c++17 -march=native -m64 -lpthread

编辑：最好没有 AVX512

编辑 2：我需要变量，len以便我可以开始解析值部分。

编辑 3：该函数将在循环中使用（例如解析 100 万行文本）。但strcmp_mask基本上总是在 L1 缓存内

编辑 4：我通过解析 10 亿行(key,value)并处理它们来对函数进行基准测试。您可以下载代码/数据并在我的存储库中复制结果： https: //github.com/lehuyduc/1brc-simd。此外，讨论帖将包含更多信息

编辑 5：我测试maskafterc256发现它导致我的代码慢了 50 倍！如果我替换_mm256_set_epi8为 …

c++ optimization assembly simd avx

Huy*_* Le

2024 01-30

5
推荐指数

2
解决办法

383
查看次数

C++ Linux 最快的时间测量方法（比 std::chrono 更快）？包含基准

#include <iostream>
#include <chrono>
using namespace std;

class MyTimer {
 private:
  std::chrono::time_point<std::chrono::steady_clock> starter;
  std::chrono::time_point<std::chrono::steady_clock> ender;

 public:
  void startCounter() {
    starter = std::chrono::steady_clock::now();
  }

  double getCounter() {
    ender = std::chrono::steady_clock::now();
    return double(std::chrono::duration_cast<std::chrono::nanoseconds>(ender - starter).count()) /
           1000000;  // millisecond output
  }
  
  // timer need to have nanosecond precision
  int64_t getCounterNs() {
    return std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::steady_clock::now() - starter).count();
  }
};

MyTimer timer1, timer2, timerMain;
volatile int64_t dummy = 0, res1 = 0, res2 = 0;

// time run without any time measure
void func0() { …

Run Code Online (Sandbox Code Playgroud)

c++ linux optimization performance time

Huy*_* Le

2021 12-22

4
推荐指数

1
解决办法

2718
查看次数

C++ std::any 将 std::any 的 C 字符数组转换为字符串的函数

#include <iostream>
#include <any>
#include <string>
#include <vector>
#include <map>
using namespace std;

string AnyPrint(const std::any &value)
{   
    cout << size_t(&value) << ", " << value.type().name() << " ";
    if (auto x = std::any_cast<int>(&value)) {
        return "int(" + std::to_string(*x) + ")";
    }
    if (auto x = std::any_cast<float>(&value)) {
        return "float(" + std::to_string(*x) + ")";
    }
    if (auto x = std::any_cast<double>(&value)) {
        return "double(" + std::to_string(*x) + ")";
    }
    if (auto x = std::any_cast<string>(&value)) {
        return "string(\"" + (*x) + "\")"; …

Run Code Online (Sandbox Code Playgroud)

c++ string c-strings rtti stdany

Huy*_* Le

2022 03-31

2
推荐指数

1
解决办法

962
查看次数

C 如何在不使用 -> 重复的情况下访问结构元素？

    struct Heap {
        int capacity;
        int heapSize;
        int *tree;     // the heap binary tree
        int *pos;       // pos[i] is the position of values[i] in items
        float *p;  // priority value of each heap element
    };

    void initHeap(struct Heap *heap, int capacity) {
        heap->capacity = capacity;
        heap->heapSize = 0;
        heap->tree = malloc(sizeof(int)*(capacity+1));
        heap->pos = malloc(sizeof(int)*(capacity+1));
        heap->p = malloc(sizeof(float)*(capacity+1));
    }

    void betterInit(struct Heap *heap, int capacity) {
        with (heap) { // doesn't exist
            capacity = capacity;
            heapSize = 0;
            tree = malloc(sizeof(int)*(capacity+1)); …

Run Code Online (Sandbox Code Playgroud)

c struct pointers pass-by-reference

Huy*_* Le

2020 03-12

0
推荐指数

1
解决办法

103
查看次数