小编Chi*_*ipK的帖子

如果没有快速收集和分散AVX2指令,你会怎么做？

我正在编写一个程序来检测素数.其中一部分是筛选可能的候选人.我写了一个相当快的程序,但我想我会看到是否有人有更好的想法.我的程序可以使用一些快速收集和分散指令,但我只限于AVX2硬件用于x86架构(我知道AVX-512有这些但我不确定它们有多快).

#include <stdint.h>
#include <immintrin.h>

#define USE_AVX2

// Sieve the bits in array sieveX for later use
void sieveFactors(uint64_t *sieveX)
{
    const uint64_t totalX = 5000000;
#ifdef USE_AVX2
    uint64_t indx[4], bits[4];

    const __m256i sieveX2 = _mm256_set1_epi64x((uint64_t)(sieveX));
    const __m256i total = _mm256_set1_epi64x(totalX - 1);
    const __m256i mask = _mm256_set1_epi64x(0x3f);

    // Just filling with some typical values (not really constant)
    __m256i ans = _mm256_set_epi64x(58, 52, 154, 1);
    __m256i ans2 = _mm256_set_epi64x(142, 70, 136, 100);

    __m256i sum = _mm256_set_epi64x(201, 213, 219, 237);    // …

Run Code Online (Sandbox Code Playgroud)

algorithm optimization performance simd avx2

Chi*_*ipK

2018 07-03

7
推荐指数

2
解决办法

1460
查看次数

快速SSE阈值算法

我正在尝试使用SSE来提出一个非常快速的阈值算法来替换它:

uint8_t *pSrc, *pDst;

// Assume pSrc and pDst point to valid data

// Handle left edge
*pDst++ = *pSrc++;

// Likeness filter
for (uint32_t k = 2; k < width; k++, pSrc++, pDst++)
    if ((*pDst - *pSrc) * (*pDst - *pSrc) > 100 /*THRESHOLD_SQUARED*/) {
        *pDst = *pSrc;
    }
}

// Handle right edge
*pDst++ = *pSrc++;

Run Code Online (Sandbox Code Playgroud)

到目前为止我有这个:

const uint8_t THRESHOLD = 10;

__attribute__((aligned (16))) static const uint8_t mask[16] = {
    THRESHOLD, THRESHOLD, THRESHOLD, THRESHOLD,
    THRESHOLD, THRESHOLD, THRESHOLD, THRESHOLD, …

Run Code Online (Sandbox Code Playgroud)

algorithm optimization performance sse simd

Chi*_*ipK

2014 10-27

6
推荐指数

1
解决办法

1324
查看次数

使用AVX2更快的查找表

我正在尝试加速执行一系列查找表的算法.我想使用SSE2或AVX2.我尝试使用_mm256_i32gather_epi32命令,但速度慢了31%.有没有人对任何改进或不同方法有任何建议？

时间:C代码= 234 Gathers = 340

static const int32_t g_tables[2][64];  // values between 0 and 63

template <int8_t which, class T>
static void lookup_data(int16_t * dst, T * src)
{
    const int32_t * lut = g_tables[which];

    // Leave this code for Broadwell or Skylake since it's 31% slower than C code
    // (gather is 12 for Haswell, 7 for Broadwell and 5 for Skylake)

#if 0
    if (sizeof(T) == sizeof(int16_t)) {
        __m256i avx0, avx1, avx2, avx3, avx4, avx5, avx6, avx7; …

Run Code Online (Sandbox Code Playgroud)

algorithm optimization performance sse simd

Chi*_*ipK

2016 03-04

5
推荐指数

1
解决办法

2238
查看次数