小编Mic*_*lli的帖子

为什么在类上设置描述符会覆盖描述符？

简单再现：

class VocalDescriptor(object):
    def __get__(self, obj, objtype):
        print('__get__, obj={}, objtype={}'.format(obj, objtype))
    def __set__(self, obj, val):
        print('__set__')

class B(object):
    v = VocalDescriptor()

B.v # prints "__get__, obj=None, objtype=<class '__main__.B'>"
B.v = 3 # does not print "__set__", evidently does not trigger descriptor
B.v # does not print anything, we overwrote the descriptor

Run Code Online (Sandbox Code Playgroud)

这个问题有一个有效的重复项，但是没有回答重复项，作为学习练习，我对CPython源码进行了更多研究。警告：我进入了杂草。我真希望我能从知道这些水域的船长那里得到帮助。为了我自己的未来利益和未来读者的利益，我试图尽可能明确地追踪正在寻找的电话。

我已经看到很多墨水溅到了__getattribute__应用于描述符的行为上，例如查找优先级。Python的片断在“援引描述符”下方For classes, the machinery is in type.__getattribute__()...大致在我的脑海里同意我认为是相应的CPython的源中type_getattro，我找到了通过看“tp_slots”然后tp_getattro填充其中。B.v …

python cpython python-descriptors

Mic*_*lli

2019 10-17

10
推荐指数

1
解决办法

137
查看次数

#pragma simd reduction(<operator>:<variable>)如何在引擎盖下工作？

我想更详细地了解英特尔编译器使用的simd减少条款是如何工作的.

特别是,对于形式的循环

double x = x_initial;
#pragma simd reduction(<operator1>:x)
for( int i = 0; i < N; i++ )
  x <operator2> some_value;

Run Code Online (Sandbox Code Playgroud)

我的天真猜测如下:编译器为每个向量通道初始化x的私有副本,然后一次遍历循环一个向量宽度.例如,如果矢量宽度是4倍,则这将对应于N/4次迭代加上最后的剥离循环.在迭代的每个步骤中,使用每个通道的x的私有副本进行更新operator2,然后最后使用4个向量通道的私有副本进行组合operator1.该自动向量化的引导似乎并不直接解决这个问题.

我做了一些实验,发现一些结果与我的期望一致,而另一些则没有.例如,我试过这个案子

double x = 1;
#pragma simd reduction(*:x) assert
for( int i = 0; i < 16; i++ )
  x += a[i];  // All elements of a are equal to 3.0
cout << "x after (*:x), x += a[i] loop:  " << x << endl;

Run Code Online (Sandbox Code Playgroud)

其中operator1*和operator2是+ =.当我为avx2编译时,其矢量宽度为4倍,输出为28561 …

intel simd vectorization avx

Mic*_*lli

lucky-day

5
推荐指数

0
解决办法

908
查看次数

cudaMemcpyAsync奇怪的行为:1.cudaMemcpyKind没什么区别.2.复制失败,但是默默无闻

我熟悉一个配备Pascal P100 GPU + Nvlink的新集群.我写了一个ping-pong程序来测试gpu < - > gpu和gpu < - > cpu带宽和点对点访问.(我知道cuda样本包含这样的程序,但我想自己做以便更好地理解.)Nvlink带宽看似合理(双向约35 GB/s,理论最大值为40).然而,在调试乒乓球时,我发现了一些奇怪的行为.

首先,无论我指定什么cudaMemcpyKind,cudaMemcpyAsync都会成功,例如,如果cudaMemcpyAsync正在将内存从主机复制到设备,即使我将cudaMemcpyDeviceToHost作为类型传递,它也会成功.

其次,当主机内存没有页面锁定时,cudaMemcpyAsync会执行以下操作:

将内存从主机复制到设备似乎成功(没有segfaults或cuda运行时错误,数据似乎正确传输).
将内存从设备复制到主机会无声地失败:没有发生段错误,并且memcpy返回cudaSuccess后cudaDeviceSynchronize,但检查数据显示gpu上的数据没有正确传输到主机.

这种行为是期待的吗？我已经包含了一个在我的系统上演示它的最小工作示例代码(示例不是乒乓应用程序,它所做的只是使用各种参数测试cudaMemcpyAsync).

P100s启用了UVA,因此cudaMemcpyAsync简单地推断src和dst指针的位置并忽略cudaMemcpyKind参数是合理的.但是,我不确定为什么cudaMemcpyAsync无法为非页面锁定的主机内存引发错误.我的印象是严格的禁忌.

#include <stdio.h>
#include <cuda_runtime.h>
#include <stdlib.h>

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
   if (code != cudaSuccess)
   {
      fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
      if (abort) exit(code);
   }
}

__global__ void checkDataDevice( int* current, int* next, int expected_current_val, int n )
{
  int tid = threadIdx.x + …

Run Code Online (Sandbox Code Playgroud)

cuda nvlink uva

Mic*_*lli

lucky-day

0
推荐指数

1
解决办法

674
查看次数