对于 x86-64，imm64 或 m64 哪个更快？

Question

对于 x86-64，imm64 或 m64 哪个更快？

Spi*_*ngo 2 optimization x86 assembly x86-64 micro-optimization

经过大约100亿次测试，如果imm64比m64AMD64快0.1纳秒，m64似乎更快，但我真的不明白。val_ptr以下代码中的地址本身不是立即数吗？

# Text section
.section __TEXT,__text,regular,pure_instructions
# 64-bit code
.code64
# Intel syntax
.intel_syntax noprefix
# Target macOS High Sierra
.macosx_version_min 10,13,0

# Make those two test functions global for the C measurer
.globl _test1
.globl _test2

# Test 1, imm64
_test1:
  # Move the immediate value 0xDEADBEEFFEEDFACE to RAX (return value)
  movabs rax, 0xDEADBEEFFEEDFACE
  ret
# Test 2, m64
_test2:
  # Move from the RAM (val_ptr) to RAX (return value)
  mov rax, qword ptr [rip + val_ptr]
  ret
# Data section
.section __DATA,__data
val_ptr:
  .quad 0xDEADBEEFFEEDFACE

Run Code Online (Sandbox Code Playgroud)

测量代码为：

#include <stdio.h>            // For printf
#include <stdlib.h>           // For EXIT_SUCCESS
#include <math.h>             // For fabs
#include <stdint.h>           // For uint64_t
#include <stddef.h>           // For size_t
#include <string.h>           // For memset
#include <mach/mach_time.h>   // For time stuff

#define FUNCTION_COUNT  2     // Number of functions to test
#define TEST_COUNT      0x10000000  // Number of times to test each function

// Type aliases
typedef uint64_t rettype_t;
typedef rettype_t(*function_t)();

// External test functions (defined in Assembly)
rettype_t test1();
rettype_t test2();

// Program entry point
int main() {

  // Time measurement stuff
  mach_timebase_info_data_t info;
  mach_timebase_info(&info);

  // Sums to divide by the test count to get average
  double sums[FUNCTION_COUNT];

  // Initialize sums to 0
  memset(&sums, 0, FUNCTION_COUNT * sizeof (double));

  // Functions to test
  function_t functions[FUNCTION_COUNT] = {test1, test2};

  // Useless results (should be 0xDEADBEEFFEEDFACE), but good to have
  rettype_t results[FUNCTION_COUNT];

  // Function loop, may get unrolled based on optimization level
  for (size_t test_fn = 0; test_fn < FUNCTION_COUNT; test_fn++) {
    // Test this MANY times
    for (size_t test_num = 0; test_num < TEST_COUNT; test_num++) {
      // Get the nanoseconds before the action
      double nanoseconds = mach_absolute_time();
      // Do the action
      results[test_fn] = functions[test_fn]();
      // Measure the time it took
      nanoseconds = mach_absolute_time() - nanoseconds;

      // Convert it to nanoseconds
      nanoseconds *= info.numer;
      nanoseconds /= info.denom;

      // Add the nanosecond count to the sum
      sums[test_fn] += nanoseconds;
    }
  }
  // Compute the average
  for (size_t i = 0; i < FUNCTION_COUNT; i++) {
    sums[i] /= TEST_COUNT;
  }

  if (FUNCTION_COUNT == 2) {
    // Print some fancy information
    printf("Test 1 took %f nanoseconds average.\n", sums[0]);
    printf("Test 2 took %f nanoseconds average.\n", sums[1]);
    printf("Test %d was faster, with %f nanoseconds difference\n", sums[0] < sums[1] ? 1 : 2, fabs(sums[0] - sums[1]));
  } else {
    // Else, just print something
    for (size_t fn_i = 0; fn_i < FUNCTION_COUNT; fn_i++) {
      printf("Test %zu took %f clock ticks average.\n", fn_i + 1, sums[fn_i]);
    }
  }

  // Everything went fine!
  return EXIT_SUCCESS;
}

Run Code Online (Sandbox Code Playgroud)

那么，哪个真的最快，m64或者imm64？

顺便说一下，我使用的是 Intel Core i7 Ivy Bridge 和 DDR3 RAM。我正在运行 macOS High Sierra。

编辑：我插入了ret说明，现在imm64结果更快了。

Answer 1

Pet*_*des 5

您没有显示您测试的实际循环，也没有说明您如何测量时间。显然，您测量的是挂钟时间，而不是核心时钟周期（使用性能计数器）。因此，您的测量噪声源包括涡轮增压/节能以及与另一个逻辑线程（在 i7 上）共享物理内核。

在英特尔常春藤桥上：

movabs rax, 0xDEADBEEFFEEDFACE 是 ALU 指令

取 10 字节的代码大小（取决于周围的代码，这可能重要也可能无关紧要）。
对于任何 ALU 端口（p0、p1 或 p5），解码为 1 uop。（最大吞吐量 = 每个时钟 3 个）
在 uop 缓存中取 2 个条目（因为 64 位立即数），从 uop 缓存中读取需要 2 个周期。（所以从循环缓冲区运行对于前端吞吐量来说是一个显着的优势，如果这是包含它的代码的瓶颈的话）。

mov rax, [RIP + val_ptr] 是负荷

需要 7 个字节（REX + 操作码 + modrm + rel32）
对于任一加载端口（p2 或 p3），解码为 1 uop。（最大吞吐量 = 每个时钟 2 个）
适合 uop 缓存中的 1 个条目（没有立即数和 32 或 32small 地址偏移量）。
如果负载跨页面边界拆分，则运行速度会慢很多，即使在 Skylake 上也是如此。
第一次可能会在缓存中丢失。

来源：Agner Fog 的 microarch pdf 和指令表。有关 uop-cache 内容，请参见表 9.1。另请参阅x86标签 wiki中的其他性能链接。

编译器通常选择生成带有mov r64, imm64. （相关：动态生成向量常量的最佳指令序列是什么？，但实际上这些指令序列永远不会出现在标量整数上，因为没有短的单指令方法来获得 64 位-1。）

这通常是正确的选择，尽管在长时间运行的循环中，您希望常量在缓存中保持热状态，从.rodata. 特别是如果这让你做类似的事情and rax, [constant]而不是movabs r8, imm64/ and rax, r8。

如果您的 64 位常量是 addresslea，请尽可能使用 RIP 相对常量。 lea rax, [rel my_symbol]在 NASM 语法中，lea my_symbol(%rip), %rax在 AT&T 中。

在考虑 asm 的微小序列时，周围的代码很重要，尤其是当它们竞争不同的吞吐量资源时。

归档时间：	8 年，7 月前
查看次数：	863 次
最近记录：	8 年，4 月前