相关疑难解决方法(0)

void histSubtractFromBits(uint64* cursor, uint16* hist){
    //traverse each bit of the 256-bit-long bitstring by splitting up into 4 bitsets
    std::bitset<64> a(*cursor);
    std::bitset<64> b(*(cursor+1));
    std::bitset<64> c(*(cursor+2));
    std::bitset<64> d(*(cursor+3));
    for(int bit = 0; bit < 64; bit++){
        hist[bit] -= a.test(bit);
    }
    for(int bit = 0; bit < 64; bit++){
        hist[bit+64] -= b.test(bit);
    }
    for(int bit = 0; bit < 64; bit++){
        hist[bit+128] -= c.test(bit);
    }
    for(int bit = 0; bit < 64; bit++){
        hist[bit+192] -= d.test(bit);
    }
}

Run Code Online (Sandbox Code Playgroud)

实际的gcc实现对bit参数进行范围检查,然后使用位掩码对-s进行范围检查.我可以在没有位集和我自己的位移/屏蔽的情况下完成它,但我相当肯定不会产生任何显着的加速(告诉我,如果我错了,为什么).

我对x86-64程序集并不是很熟悉,但我知道有 …

c++ assembly gcc bitset

Gre*_*ida

lucky-day

4
推荐指数

1
解决办法

190
查看次数

如何加速这个 LUT 查找的直方图？

首先，我有一个数组int a[1000][1000]。所有这些整数都在 0 到 32767 之间，它们是已知的常量：它们在程序运行期间永远不会改变。

其次，我有一个数组 b[32768]，它包含 0 到 32 之间的整数。我使用这个数组将 a 中的所有数组映射到 32 个 bin：

int bins[32]{};
for (auto e : a[i])//mapping a[i] to 32 bins.
    bins[b[e]]++;

Run Code Online (Sandbox Code Playgroud)

每次，数组 b 将用一个新数组初始化，我需要将数组 a 中的所有 1000 个数组（每个包含 1000 个整数）散列到 1000 个数组，每个数组包含 32 个整数，表示有多少整数落入其每个 bin 。

int new_array[32768] = {some new mapping};
copy(begin(new_array), end(new_array), begin(b));//reload array b;

int bins[1000][32]{};//output array to store results .
for (int i = 0; i < 1000;i++)
    for (auto e : a[i])//hashing a[i] …

Run Code Online (Sandbox Code Playgroud)

c++ optimization simd histogram

iou*_*vxz

2021 06-07

4
推荐指数

1
解决办法

1441
查看次数

如何使用 neon 内在函数优化直方图统计？

我想用 neon 内在函数优化直方图统计代码。但我没有成功。这是 c 代码：

#define NUM (7*1024*1024)
uint8 src_data[NUM];
uint32 histogram_result[256] = {0};
for (int i = 0; i < NUM; i++)
{
    histogram_result[src_data[i]]++;
}

Run Code Online (Sandbox Code Playgroud)

Historam 统计更像是串行处理。用 neon 内在函数很难优化。有人知道如何优化吗？提前谢谢。

intrinsics neon

mao*_*ofu

lucky-day

3
推荐指数

1
解决办法

1814
查看次数

选择性地使用AVX2指令对列表中的元素进行排序

我想用AVX2指令加快以下操作,但我无法找到一种方法.

我得到了uint64_t data[100000]一大堆uint64_t和一个unsigned char indices[100000]字节数组.我想输出一个数组uint64_t Out[256],其中第i个值是所有data[j]这样的xor index[j]=i.

我想要的直接实现是这样的:

uint64_t Out[256] = {0};     // initialize output array
for (i = 0; i < 100000 ; i++) {
    Out[Indices[i]] ^= data[i];
}

Run Code Online (Sandbox Code Playgroud)

我们可以使用AVX2指令更有效地实现这一点吗？

编辑:这是我的代码现在的样子

uint64_t Out[256][4] = {0};   // initialize output array
for (i = 0; i < 100000 ; i+=4) {
    Out[Indices[i  ]][0] ^= data[i];
    Out[Indices[i+1]][1] ^= data[i+1];
    Out[Indices[i+2]][2] ^= data[i+2];
    Out[Indices[i+3]][3] ^= data[i+3];
}

Run Code Online (Sandbox Code Playgroud)

optimization x86 simd avx avx2

War*_*ens

2018 06-01

3
推荐指数

1
解决办法

226
查看次数

使用 OpenMP 原子捕获操作获取粒子 3D 直方图并创建索引的竞争条件

我的完整代码中有一段代码：

const unsigned int GL=8000000;\nconst int cuba=8;\nconst int cubn=cuba+cuba;\nconst int cub3=cubn*cubn*cubn;\nint Length[cub3];\nint Begin[cub3];\nint Counter[cub3];\nint MIndex[GL];\nstruct Particle{\n  int ix,jy,kz;\n  int ip;\n};\nParticle particles[GL];\nint GetIndex(const Particle & p){return (p.ix+cuba+cubn*(p.jy+cuba+cubn*(p.kz+cuba)));}    \n...\n#pragma omp parallel for\nfor(int i=0; i<cub3; ++i) Length[i]=Counter[i]=0;\n#pragma omp parallel for\nfor(int i=0; i<N; ++i)\n{\n  int ic=GetIndex(particles[i]);\n  #pragma omp atomic update\n  Length[ic]++;\n}\nBegin[0]=0;\n#pragma omp single\nfor(int i=1; i<cub3; ++i) Begin[i]=Begin[i-1]+Length[i-1];\n#pragma omp parallel for\nfor(int i=0; i<N; ++i)\n{\n  if(particles[i].ip==3)\n  {\n    int ic=GetIndex(particles[i]);\n    if(ic>cub3 || ic<0) printf("ic=%d out of range!\\n",ic);\n    int cnt=0;\n  #pragma omp atomic capture\n    cnt=Counter[ic]++;\n    MIndex[Begin[ic]+cnt]=i;\n  }\n}\n …

Run Code Online (Sandbox Code Playgroud)

c++ multithreading atomic openmp histogram

And*_*And

2022 06-12

3
推荐指数

1
解决办法

214
查看次数

大型数组或列表的 4 桶直方图的微观优化

我有一个特别的问题。我将尝试尽可能准确地描述这一点。

我正在做一个非常重要的“微优化”。一次运行数天的循环。所以如果我能减少这个循环时间，它需要一半的时间。10 天将减少到只有 5 天等。

我现在拥有的循环是函数：“testbenchmark1”。

我有 4 个索引需要在这样的循环中增加。但是当从列表中访问索引时，实际上需要一些额外的时间，正如我所注意到的。这就是我想知道是否有其他解决方案。

indexes[n]++; //increase correct index

“testbenchmark1”的完整代码需要 122 毫秒：

void testbenchmark00()
{
    Random random = new Random();
    List<int> indexers = new List<int>();
    for (int i = 0; i < 9256408; i++)
    {
        indexers.Add(random.Next(0, 4));
    }
    int[] valueLIST = indexers.ToArray();


    Stopwatch stopWatch = new Stopwatch();
    stopWatch.Start();

    int[] indexes = { 0, 0, 0, 0 };
    foreach (int n in valueLIST) //Takes 122 ms
    {
        indexes[n]++; //increase correct index
    }

    stopWatch.Stop();
    MessageBox.Show("stopWatch: " + stopWatch.ElapsedMilliseconds.ToString() …

Run Code Online (Sandbox Code Playgroud)

c# optimization simd histogram micro-optimization

And*_*eas

2020 04-10

1
推荐指数

1
解决办法

385
查看次数