相关疑难解决方法(0)

POPCNT如何在硬件中实现？

根据http://www.agner.org/optimize/instruction_tables.pdf,该POPCNT指令(返回32位或64位寄存器中的设置位数)在现代的每个时钟周期内具有1个指令的吞吐量英特尔和AMD处理器.这比需要多条指令的任何软件实现要快得多(如何计算32位整数中的设置位数？).

POPCNT如何在硬件中如此有效地实施？

hardware x86 assembly

Siq*_*Lin

2017 05-23

6
推荐指数

1
解决办法

2138
查看次数

计算__m256i字中的前导零

我正在修改AVX-2指令,我正在寻找一种快速计算__m256i单词中前导零数(具有256位)的方法.

到目前为止,我已经找到了以下方法:

// Computes the number of leading zero bits.
// Here, avx_word is of type _m256i.

if (!_mm256_testz_si256(avx_word, avx_word)) {
  uint64_t word = _mm256_extract_epi64(avx_word, 0);
  if (word > 0)
    return (__builtin_clzll(word));

  word = _mm256_extract_epi64(avx_word, 1);
  if (word > 0)
    return (__builtin_clzll(word) + 64);

  word = _mm256_extract_epi64(avx_word, 2);
  if (word > 0)
    return (__builtin_clzll(word) + 128);

  word = _mm256_extract_epi64(avx_word, 3);
  return (__builtin_clzll(word) + 192);
} else
  return 256; // word is entirely zero

Run Code Online (Sandbox Code Playgroud)

但是,我发现在256位寄存器中找出确切的非零字是相当笨拙的.

有人知道是否有更优雅(或更快)的方法吗？

正如附加信息:我实际上想要计算由逻辑AND创建的任意长向量的第一个设置位的索引,并且我将标准64位操作的性能与SSE和AVX-2代码进行比较.这是我的整个测试代码:

#include <stdio.h> …

Run Code Online (Sandbox Code Playgroud)

c x86 simd intrinsics avx

Sve*_*ger

2018 03-11

6
推荐指数

1
解决办法

871
查看次数

在__uint128_t上最有效的popcount？

我需要以最有效（最快）的方式来弹出大小为128位的无符号变量。

操作系统：Linux / Debian 9
编译器：GCC 8
处理器：Intel i7-5775C

尽管解决方案更便携，甚至更好。

首先，GCC中有两种类型，分别是__uint128_t和unsigned __int128。我猜他们最终还是一样，看不出有什么理由写丑陋的unsigned __int128东西，因此尽管它应该是新类型，但我更喜欢第一个，它与标准更加相似uint64_t。另外，英特尔拥有__uint128_t使用它的另一个原因（可移植性）。

我写了以下代码：

#include <nmmintrin.h>
#include <stdint.h>

static inline   uint_fast8_t    popcnt_u128 (__uint128_t n)
{
    const uint64_t      n_hi    = n >> 64;
    const uint64_t      n_lo    = n;
    const uint_fast8_t  cnt_hi  = _mm_popcnt_u64(n_hi);
    const uint_fast8_t  cnt_lo  = _mm_popcnt_u64(n_lo);
    const uint_fast8_t  cnt     = cnt_hi + cnt_lo;

    return  cnt;
}

Run Code Online (Sandbox Code Playgroud)

这是绝对最快的选择吗？

编辑：

我想到了另一个选择，它可能会（或不会）更快：

#include <nmmintrin.h>
#include <stdint.h>

union   Uint128 {
    __uint128_t …

Run Code Online (Sandbox Code Playgroud)

c gcc x86-64 intel micro-optimization

Cac*_*ito

2019 03-06

6
推荐指数

1
解决办法

319
查看次数

位操作:将公共部分保持在最后一个不同位的左侧

考虑用二进制编写的两个数字(左边是MSB):

X = x7 x6 x5 x4 x3 x2 x1 x0

Run Code Online (Sandbox Code Playgroud)

和

Y = y7 y6 y5 y4 y3 y2 y1 y0

Run Code Online (Sandbox Code Playgroud)

这些数字可以具有任意数量的位,但两者的类型相同.现在考虑x7 == y7,x6 == y6,x5 == y5,但x4 != y4.

如何计算:

Z = x7 x6 x5 0 0 0 0 0

Run Code Online (Sandbox Code Playgroud)

或换句话说,如何有效地计算一个数字,使公共部分保持在最后一个不同位的左边？

template <typename T>
inline T f(const T x, const T y) 
{
    // Something here
}

Run Code Online (Sandbox Code Playgroud)

例如,对于:

x = 10100101
y = 10110010

Run Code Online (Sandbox Code Playgroud)

它应该回来

z = 10100000

Run Code Online (Sandbox Code Playgroud)

注意:它用于超级计算,此操作将执行数十亿次,因此应该避免逐个扫描位...

c c++ algorithm optimization bit-manipulation

Vin*_*ent

2014 02-03

5
推荐指数

1
解决办法

276
查看次数

我可以摆脱 CTZ 和添加到指针之间的符号扩展吗？

对于这样的代码：

#include <stdint.h>

char* ptrAdd(char* ptr, uint32_t x)
{
    return ptr + (uint32_t)__builtin_ctz(x);
}

Run Code Online (Sandbox Code Playgroud)

GCC 生成一个符号扩展：（godbolt 链接）

xor eax, eax
rep bsf eax, esi
cdqe ; sign-extend eax into rax
add rax, rdi
ret

Run Code Online (Sandbox Code Playgroud)

当然，这完全是多余的——这是公然对无符号整数进行符号扩展。我可以说服海湾合作委员会不要这样做吗？

这个问题自 GCC 4.9.0 以来就存在，但在此之前它曾经是一个显式的零扩展，这也是多余的。

x86 assembly gcc

har*_*old

2018 02-06

5
推荐指数

1
解决办法

148
查看次数

NASM:计算32位数中的多少位设置为1

我有一个32位数字,想知道有多少位是1.

我在考虑这个伪代码:

mov eax, [number]
while(eax != 0)
{
  div eax, 2
  if(edx == 1)
  {
   ecx++;
  } 
  shr eax, 1
}

Run Code Online (Sandbox Code Playgroud)

有更有效的方法吗？

我在x86处理器上使用NASM.

(我刚开始使用汇编程序,所以请不要告诉我使用extern库中的代码,因为我甚至不知道如何包含它们;))

(我刚刚发现如何计算32位整数中的设置位数？这也包含我的解决方案.还有其他解决方案,但不幸的是我似乎无法弄清楚,我将如何在汇编程序中编写它们)

x86 assembly bit-manipulation nasm

cit*_*nas

2017 05-23

4
推荐指数

3
解决办法

9716
查看次数

VS:_BitScanReverse64内在的意外优化行为

以下代码在调试模式下工作正常,因为如果没有设置Bit,_BitScanReverse64被定义为返回0.引用MSDN :(返回值为)"如果设置了索引则为非零,如果未找到设置位,则为0."

如果我在发布模式下编译此代码它仍然有效,但如果我启用编译器优化,例如\ O1或\ O2,则索引不为零且assert()失败.

#include <iostream>
#include <cassert>

using namespace std;

int main()
{
  unsigned long index = 0;
  _BitScanReverse64(&index, 0x0ull);

  cout << index << endl;

  assert(index == 0);

  return 0;
}

Run Code Online (Sandbox Code Playgroud)

这是预期的行为吗？我正在使用Visual Studio Community 2015,版本14.0.25431.01更新3.(我离开了cout,因此在优化期间不会删除变量索引).还有一个有效的解决方法或我不应该直接使用此编译器内在？

c++ optimization x86-64 intrinsics visual-studio

Cam*_*eon

2019 11-24

4
推荐指数

1
解决办法

584
查看次数

如何在此循环中获得始终如一的高吞吐量？

在优化内循环的过程中,我遇到了奇怪的性能行为,我无法理解和纠正.

代码的精简版本如下; 粗略地说,有一个巨大的数组被分成16个字块,我简单地将每个块中字的前导零的数量加起来.(实际上我正在使用Dan Luu的popcnt代码,但是在这里我选择了一个具有类似性能特征的简单指令,用于"简洁".Dan Luu的代码基于这个SO问题的答案,虽然它具有诱人的类似奇怪的结果,似乎没有在这里回答我的问题.)

// -*- compile-command: "gcc -O3 -march=native -Wall -Wextra -std=c99 -o clz-timing clz-timing.c" -*-
#include <stdint.h>
#include <time.h>
#include <stdlib.h>
#include <stdio.h>

#define ARRAY_LEN 16

// Return the sum of the leading zeros of each element of the ARRAY_LEN
// words starting at u.
static inline uint64_t clz_array(const uint64_t u[ARRAY_LEN]) {
    uint64_t c0 = 0;
    for (int i = 0; i < ARRAY_LEN; ++i) {
        uint64_t t0;
        __asm__ ("lzcnt %1, …

Run Code Online (Sandbox Code Playgroud)

c c++ performance assembly caching

Ham*_*ish

lucky-day

4
推荐指数

1
解决办法

139
查看次数

使用变量__builtin_clz如何将文字0和0作为变量产生不同的行为？

只有1种情况__builtin_clz给出错误的答案。我很好奇是什么导致了这种行为。

当我使用文字值0时，我总是得到32的期望值。但是0作为变量将产生31。为什么存储值0的方法很重要？

我上过架构课程，但不了解差异化的程序集。看起来当给定字面值0时，即使不进行优化，该汇编总会以某种方式始终具有32个硬编码的正确答案。使用-march = native时，用于计算前导零的方法也不同。

这篇文章关于模拟__builtin_clz与_BitScanReverse和行bsrl %eax, %eax似乎意味着位扫描反向不起作用0。

+-------------------+-------------+--------------+
|      Compile      | literal.cpp | variable.cpp |
+-------------------+-------------+--------------+
| g++               |          32 |           31 |
| g++ -O            |          32 |           32 |
| g++ -march=native |          32 |           32 |
+-------------------+-------------+--------------+

Run Code Online (Sandbox Code Playgroud)

literal.cpp

#include <iostream>

int main(){
    int i = 0;
    std::cout << __builtin_clz(0) << std::endl;
}

Run Code Online (Sandbox Code Playgroud)

variable.cpp

#include <iostream>

int main(){
    int i = 0;
    std::cout << __builtin_clz(i) << std::endl;
} …

Run Code Online (Sandbox Code Playgroud)

c++ assembly gcc intrinsics undefined-behavior

Mat*_*att

2019 11-24

4
推荐指数

1
解决办法

90
查看次数

避免 bzhi(y, tzcnt(x)) 中不必要的 mov ecx, ecx 指令

我有一个位位置（它永远不会为零），通过使用 tzcnt 计算得出，我想从该位置开始将高位归零。这是 C++ 和反汇编代码（我使用的是 MSVC）：

auto position = _tzcnt_u64(xxx); 
auto masked =_bzhi_u64(yyy, static_cast<uint32_t>(position));

Run Code Online (Sandbox Code Playgroud)

tzcnt       rcx,rdx  
mov         ecx,ecx  
bzhi        rax,rbx,rcx

Run Code Online (Sandbox Code Playgroud)

BZHI 接受 unsigned int 作为第二个参数，但仅使用 rcx 中的位 [7..0]，因此我认为这个“mov”指令是不必要的。

我用它来稍后计算 popcount，所以我也可以使用类似 <<(64-position) 的东西来代替。

问题是 - 这两个代码具有相同的执行时间，尽管 bzhi 应该比 sub+shlx 执行得更快，所以 mov 可能会产生差异。

有没有办法避免它或者这是编译器的事情？

c++ assembly bit-manipulation compiler-optimization visual-c++

Mar*_*rka

2020 05-25

4
推荐指数

1
解决办法

562
查看次数

标签统计

assembly ×6

c++ ×5

c ×4

x86 ×4

bit-manipulation ×3

gcc ×3

intrinsics ×3

optimization ×2

x86-64 ×2

algorithm ×1

avx ×1

caching ×1

compiler-optimization ×1

hardware ×1

intel ×1

micro-optimization ×1

nasm ×1

performance ×1

simd ×1

undefined-behavior ×1

visual-c++ ×1

visual-studio ×1

literal.cpp

variable.cpp

标签 统计

标签统计