相关疑难解决方法(0)

#include <cstdint>

constexpr bool noLambda = true;

void funnyEval(const uint8_t* columnData, uint64_t dataOffset, uint64_t dictOffset, int32_t iter, int32_t limit, int32_t* writer,const int32_t* dictPtr2){
   // Computation X1
   const int32_t* dictPtr = reinterpret_cast<const int32_t*>(columnData + dictOffset);
   // Computation X2
   const uint16_t* data = (const uint16_t*)(columnData + dataOffset);
   // 1. The less broken solution without lambda
   if (noLambda) {
        for (;iter != limit;++iter){
            int32_t t=dictPtr[data[iter]];
            *writer = t;
            writer++;
        }
   }
   // 2. The totally broken solution with lambda
   else …

Run Code Online (Sandbox Code Playgroud)

c++ optimization assembly gcc compiler-optimization

gex*_*ide

2018 06-03

34
推荐指数

2
解决办法

873
查看次数

Haswell/Skylake的部分寄存器究竟如何表现？写AL似乎对RAX有假依赖,而AH是不一致的

此循环在英特尔Conroe/Merom上每3个周期运行一次,imul按预期方式在吞吐量方面存在瓶颈.但是在Haswell/Skylake上,它每11个循环运行一次,显然是因为setnz al它依赖于最后一个循环imul.

; synthetic micro-benchmark to test partial-register renaming
    mov     ecx, 1000000000
.loop:                 ; do{
    imul    eax, eax     ; a dep chain with high latency but also high throughput
    imul    eax, eax
    imul    eax, eax

    dec     ecx          ; set ZF, independent of old ZF.  (Use sub ecx,1 on Silvermont/KNL or P4)
    setnz   al           ; ****** Does this depend on RAX as well as ZF?
    movzx   eax, al
    jnz  .loop         ; }while(ecx);

Run Code Online (Sandbox Code Playgroud)

如果setnz al …

x86 assembly intel cpu-architecture micro-optimization

Pet*_*des

2017 08-21

30
推荐指数

2
解决办法

1537
查看次数

x86的MOV真的可以"免费"吗？为什么我不能重现这个呢？

我一直看到人们声称MOV指令可以在x86中免费,因为寄存器重命名.

对于我的生活,我无法在一个测试用例中验证这一点.每个测试用例我尝试揭穿它.

例如,这是我用Visual C++编译的代码:

#include <limits.h>
#include <stdio.h>
#include <time.h>

int main(void)
{
    unsigned int k, l, j;
    clock_t tstart = clock();
    for (k = 0, j = 0, l = 0; j < UINT_MAX; ++j)
    {
        ++k;
        k = j;     // <-- comment out this line to remove the MOV instruction
        l += j;
    }
    fprintf(stderr, "%d ms\n", (int)((clock() - tstart) * 1000 / CLOCKS_PER_SEC));
    fflush(stderr);
    return (int)(k + j + l);
}

Run Code Online (Sandbox Code Playgroud)

这为循环生成以下汇编代码(随意生成这个你想要的;你显然不需要Visual C++):

LOOP:
    add edi,esi
    mov …

Run Code Online (Sandbox Code Playgroud)

c x86 assembly cpu-registers micro-optimization

Meh*_*dad

2017 05-26

23
推荐指数

2
解决办法

2113
查看次数

在SIMD中矢量化直方图的方法？

我正在尝试在霓虹灯中实现直方图.矢量化是否可能？

arm image-processing simd histogram neon

Rug*_*ger

lucky-day

14
推荐指数

2
解决办法

2910
查看次数

如果没有快速收集和分散AVX2指令,你会怎么做？

我正在编写一个程序来检测素数.其中一部分是筛选可能的候选人.我写了一个相当快的程序,但我想我会看到是否有人有更好的想法.我的程序可以使用一些快速收集和分散指令,但我只限于AVX2硬件用于x86架构(我知道AVX-512有这些但我不确定它们有多快).

#include <stdint.h>
#include <immintrin.h>

#define USE_AVX2

// Sieve the bits in array sieveX for later use
void sieveFactors(uint64_t *sieveX)
{
    const uint64_t totalX = 5000000;
#ifdef USE_AVX2
    uint64_t indx[4], bits[4];

    const __m256i sieveX2 = _mm256_set1_epi64x((uint64_t)(sieveX));
    const __m256i total = _mm256_set1_epi64x(totalX - 1);
    const __m256i mask = _mm256_set1_epi64x(0x3f);

    // Just filling with some typical values (not really constant)
    __m256i ans = _mm256_set_epi64x(58, 52, 154, 1);
    __m256i ans2 = _mm256_set_epi64x(142, 70, 136, 100);

    __m256i sum = _mm256_set_epi64x(201, 213, 219, 237);    // …

Run Code Online (Sandbox Code Playgroud)

algorithm optimization performance simd avx2

Chi*_*ipK

2018 07-03

7
推荐指数

2
解决办法

1460
查看次数