虚假分享和pthreads

Question

虚假分享和pthreads

我有以下任务来演示虚假共享并编写了一个简单的程序:

#include <sys/times.h>
#include <time.h>
#include <stdio.h> 
#include <pthread.h> 

long long int tmsBegin1,tmsEnd1,tmsBegin2,tmsEnd2,tmsBegin3,tmsEnd3;

int array[100];

void *heavy_loop(void *param) { 
  int   index = *((int*)param);
  int   i;
  for (i = 0; i < 100000000; i++)
    array[index]+=3;
} 

int main(int argc, char *argv[]) { 
  int       first_elem  = 0;
  int       bad_elem    = 1;
  int       good_elem   = 32;
  long long time1;
  long long time2;
  long long time3;
  pthread_t     thread_1;
  pthread_t     thread_2;

  tmsBegin3 = clock();
  heavy_loop((void*)&first_elem);
  heavy_loop((void*)&bad_elem);
  tmsEnd3 = clock();

  tmsBegin1 = clock();
  pthread_create(&thread_1, NULL, heavy_loop, (void*)&first_elem);
  pthread_create(&thread_2, NULL, heavy_loop, (void*)&bad_elem);
  pthread_join(thread_1, NULL);
  pthread_join(thread_2, NULL);
  tmsEnd1 = clock(); 

  tmsBegin2 = clock();
  pthread_create(&thread_1, NULL, heavy_loop, (void*)&first_elem);
  pthread_create(&thread_2, NULL, heavy_loop, (void*)&good_elem);
  pthread_join(thread_1, NULL);
  pthread_join(thread_2, NULL);
  tmsEnd2 = clock();

  printf("%d %d %d\n", array[first_elem],array[bad_elem],array[good_elem]);
  time1 = (tmsEnd1-tmsBegin1)*1000/CLOCKS_PER_SEC;
  time2 = (tmsEnd2-tmsBegin2)*1000/CLOCKS_PER_SEC;
  time3 = (tmsEnd3-tmsBegin3)*1000/CLOCKS_PER_SEC;
  printf("%lld ms\n", time1);
  printf("%lld ms\n", time2);
  printf("%lld ms\n", time3);

  return 0; 
}

Run Code Online (Sandbox Code Playgroud)

当我看到结果时(我在i5-430M处理器上运行它),我感到非常惊讶.

假共享,它是1020毫秒.
没有错误共享,它是710毫秒,只有30%快,而不是300%(它写在一些网站上,它会快于300-400%).
不使用pthreads,它是580毫秒.

请告诉我我的错误或解释为什么会发生.

Answer 1

Jay*_*rod 21

虚假共享是多个内核的结果,其中单独的高速缓存访问物理内存的相同区域(尽管不是相同的地址 - 这将是真正的共享).

要了解错误共享,您需要了解缓存.在大多数处理器中,每个核心都有自己的L1缓存,它保存最近访问的数据.高速缓存以"行"组织,这些行是对齐的数据块,通常为32或64字节长度(取决于您的处理器).当您从不在高速缓存中的地址读取时,整行将从主存储器(或L2高速缓存)读入L1.当您写入缓存中的地址时,包含该地址的行标记为"脏".

这是共享方面的用武之地.如果多个核心从同一行读取,则每个核心都可以在L1中获得该行的副本.但是,如果副本标记为脏,则会使其他高速缓存中的行无效.如果没有发生这种情况,那么在一个核心上进行的写入可能在很久以后才会被其他核心看到.因此,下次另一个核心从该行读取时,缓存未命中,并且必须再次获取该行.

当核心正在读取和写入同一行上的不同地址时,会发生错误共享.即使他们不共享数据,缓存也会像他们一样,因为它们非常接近.

此效果高度依赖于处理器的体系结构.如果你有一个核心处理器,你根本看不到效果,因为没有共享.如果您的缓存行更长,您会在"坏"和"好"情况下看到效果,因为它们仍然靠近在一起.如果您的内核没有共享L2缓存(我猜他们会这样做),您可能会看到300-400%的差异,因为他们必须在缓存未命中时一直到主内存.

您可能还想知道每个线程都在读取和写入(+ =而不是=)是很重要的.某些处理器具有直写高速缓存,这意味着如果核心写入不在高速缓存中的地址,则它不会错过并从内存中获取该行.与回写缓存相比,后者确实错过了写入.

@AlexeyMatveev:好的,这就是我要做的改进测试的方法.摆脱`clock()`(这是粗略的近似)并用高精度硬件定时器代替它.如果您碰巧在Linux上运行,可以使用带有`CLOCK_MONOTONIC_RAW`标志的`clock_gettime`(参见https://bitbucket.org/Yocto/yocto/src/9cec50caf923/include/yocto/stopwatch.hpp和https:// bitbucket .org/Yocto/yocto/src/9cec50caf923/src/stopwatch.cpp例如).... (2认同)
禁用CPU限制并提出一些操作,在单线程模式下将产生至少1.5秒(否则节能将在最初几秒内搞砸你的基准).然后确保编译器不会在不需要的位置应用优化.例如,我说"i"必须是volatile才能避免展开,并且数组也是如此,否则编译器可能会决定完全抛弃你的循环.最后,您必须在"heavy_loop"中测量时间以排除线程管理开销. (2认同)

Answer 2

Jai*_* MJ 4

C语言中的clock()函数简介：它给出了从开始到结束所经过的CPU时钟数。因此，当运行两个并行线程时，CPU 周期数将为 CPU1 的时钟周期 + CPU2 的时钟周期。

我想你想要的是一个实时时钟。对于此用途

时钟获取时间()

你应该得到预期的输出。

我用clock_gettime()运行了你的代码，我得到了这个：

错误共享 874.587381 ms
无虚假共享 331.844278 ms
顺序计算 604.160276 ms

归档时间：	13 年，9 月前
查看次数：	3495 次
最近记录：	6 年，3 月前