为什么Knuth使用这个笨重的减量?

Ed *_*ynn 12 c knuth increment literate-programming decrement

我正在看一些Don Knuth教授的代码,用CWEB编写,转换为C.具体的例子是dlx1.w,可从Knuth的网站获得

在一个阶段,struct nd [cc]的.len值递减,并且以一种笨重的方式完成:

  o,t=nd[cc].len-1;
  o,nd[cc].len=t;
Run Code Online (Sandbox Code Playgroud)

(这是一个特定于Knuth的问题,所以也许你已经知道"o"是一个预处理器宏,用于递增"mems",这是通过访问64位字来衡量的累计工作量.) "t"中剩余的值绝对不会用于其他任何事情.(此处的示例位于dlx1.w的第665行,或者是ctangle之后的dlx1.c的第193行.)

我的问题是:为什么Knuth这样写,而不是

nd[cc].len--;
Run Code Online (Sandbox Code Playgroud)

他确实在其他地方使用过(dlx1.w第551行):

oo,nd[k].len--,nd[k].aux=i-1;
Run Code Online (Sandbox Code Playgroud)

(而"oo"是一个类似的宏,用于递增"mems"两次 - 但这里有一些细微之处,因为.len和.aux存储在相同的64位字中.为S.len和S分配值. aux,通常只计算mems的一个增量.)

我唯一的理论是减量包括两个内存访问:首先查找,然后分配.(这是正确的吗?)这种写作方式提醒了这两个步骤.这对于Knuth来说会非常冗长,但也许这是一种本能的备忘录,而不是说教.

为了它的价值,我在没有找到答案的情况下搜索了CWEB文档.我的问题可能更多地与Knuth的标准实践有关,我正在逐渐采用.我会对这些实践被布局(并且可能被批评)作为一个块的任何资源感兴趣 - 但是现在,让我们关注为什么Knuth以这种方式编写它.

Shr*_*saR 6

A preliminary remark: with Knuth-style literate programming (i.e. when reading WEB or CWEB programs) the \xe2\x80\x9creal\xe2\x80\x9d program, as conceived by Knuth, is neither the \xe2\x80\x9csource\xe2\x80\x9d\xc2\xa0.w file nor the generated (tangled) .c file, but the typeset (woven) output. The source .w file is best thought of as a means to produce it (and of course also the .c source that\'s fed to the compiler). (If you don\'t have cweave and TeX handy; I\'ve typeset some of these programs here; this program DLX1 is here.)

\n\n

So in this case, I\'d describe the location in the code as module 25 of DLX1, or subroutine "cover":

\n\n

问题

\n\n

Anyway, to return to the actual question: note that this (DLX1) is one of the programs written for The Art of Computer Programming. Because reporting the time taken by a program \xe2\x80\x9cseconds\xe2\x80\x9d\xc2\xa0or \xe2\x80\x9cminutes\xe2\x80\x9d\xc2\xa0becomes meaningless from year to year, he reports how long a program took in number of \xe2\x80\x9cmems\xe2\x80\x9d\xc2\xa0plus \xe2\x80\x9coops\xe2\x80\x9d, that\'s dominated by the \xe2\x80\x9cmems\xe2\x80\x9d, i.e. the number of memory accesses to 64-bit words (usually). So the book contains statements like \xe2\x80\x9cthis program finds the answer to this problem in 3.5 gigamems of running time\xe2\x80\x9d. Further, the statements are intended to be fundamentally about the program/algorithm itself, not the specific code generated by a specific version of a compiler for certain hardware. (Ideally when the details are very important he writes the program in MMIX or MMIXAL and analyses its operations on the MMIX hardware, but this is rare.) Counting the mems (to be reported as above) is the purpose of inserting o and oo instructions into the program. Note that it\'s more important to get this right for the \xe2\x80\x9cinner loop\xe2\x80\x9d instructions that are executed a lot of times, such as everything in the subroutine cover in this case.

\n\n

This is elaborated in Section 1.3.1\xe2\x80\xb2 (part of Fascicle 1):

\n\n
\n

Timing. [\xe2\x80\xa6] The running time of a program depends not only on the clock rate but also on the number of functional units that can be active simultaneously and the degree to which they are pipelined; it depends on the techniques used to prefetch instructions before they are executed; it depends on the size of the random-access memory that is used to give the illusion of 264 virtual bytes; and it depends on the sizes and allocation strategies of caches and other buffers, etc., etc.

\n \n

For practical purposes, the running time of an MMIX program can often be estimated satisfactorily by assigning a fixed cost to each operation, based on the approximate running time that would be obtained on a high-performance machine with lots of main memory; so that\xe2\x80\x99s what we will do. Each operation will be assumed to take an integer number of \xcf\x85, where \xcf\x85 (pronounced \xe2\x80\x9coops\xe2\x80\x9d) is a unit that represents the clock cycle time in a pipelined implementation. Although the value of \xcf\x85 decreases as technology improves, we always keep up with the latest advances because we measure time in units of \xcf\x85, not in nanoseconds. The running time in our estimates will also be assumed to depend on the number of memory references or mems that a program uses; this is the number of load and store instructions. For example, we will assume that each LDO (load octa) instruction costs \xc2\xb5 + \xcf\x85, where \xc2\xb5 is the average cost of a memory reference. The total running time of a program might be reported as, say, 35\xc2\xb5+ 1000\xcf\x85, meaning \xe2\x80\x9c35 mems plus 1000 oops.\xe2\x80\x9d The ratio \xc2\xb5/\xcf\x85 has been increasing steadily for many years; nobody knows for sure whether this trend will continue, but experience has shown that \xc2\xb5 and \xcf\x85 deserve to be considered independently.

\n
\n\n

And he does of course understand the difference from reality:

\n\n
\n

Even though we will often use the assumptions of Table 1 for seat-of-the-pants estimates of running time, we must remember that the actual running time might be quite sensitive to the ordering of instructions. For example, integer division might cost only one cycle if we can find 60 other things to do between the time we issue the command and the time we need the result. Several LDB (load byte) instructions might need to reference memory only once, if they refer to the same octabyte. Yet the result of a load command is usually not ready for use in the immediately following instruction. Experience has shown that some algorithms work well with cache memory, and others do not; therefore \xc2\xb5 is not really constant. Even the location of instructions in memory can have a significant effect on performance, because some instructions can be fetched together with others. [\xe2\x80\xa6] Only the meta-simulator can be trusted to give reliable information about a program\xe2\x80\x99s actual behavior in practice; but such results can be difficult to interpret, because infinitely many configurations are possible. That\xe2\x80\x99s why we often resort to the much simpler estimates of Table 1.

\n
\n\n

最后,我们可以使用 Godbolt 的Compiler Explorer来查看典型编译器为此代码生成的代码。(理想情况下,我们会查看 MMIX 指令,但由于我们不能这样做,所以让我们采用默认值,这似乎是 x68-64 gcc 8.2。)我删除了所有 sooos。

\n\n

对于代码版本:

\n\n
  /*o*/ t = nd[cc].len - 1;\n  /*o*/ nd[cc].len = t;\n
Run Code Online (Sandbox Code Playgroud)\n\n

第一行生成的代码是:

\n\n
  movsx rax, r13d\n  sal rax, 4\n  add rax, OFFSET FLAT:nd+8\n  mov eax, DWORD PTR [rax]\n  lea r14d, [rax-1]\n
Run Code Online (Sandbox Code Playgroud)\n\n

第二行是:

\n\n
  movsx rax, r13d\n  sal rax, 4\n  add rax, OFFSET FLAT:nd+8\n  mov DWORD PTR [rax], r14d\n
Run Code Online (Sandbox Code Playgroud)\n\n

对于代码版本:

\n\n
  /*o ?*/ nd[cc].len --;\n
Run Code Online (Sandbox Code Playgroud)\n\n

生成的代码是:

\n\n
  movsx rax, r13d\n  sal rax, 4\n  add rax, OFFSET FLAT:nd+8\n  mov eax, DWORD PTR [rax]\n  lea edx, [rax-1]\n  movsx rax, r13d\n  sal rax, 4\n  add rax, OFFSET FLAT:nd+8\n  mov DWORD PTR [rax], edx\n
Run Code Online (Sandbox Code Playgroud)\n\n

正如您所看到的(即使对 x86-64 汇编不太了解)只是前一种情况中生成的代码的串联(除了使用寄存器edx而不是r14d),所以它不像在一个中编写递减线路已为您节省了任何内存。特别是,将其算作单个是不正确的,尤其是cover在该算法中被称为大量次的情况(精确覆盖的舞蹈链接)。

\n\n

因此 Knuth 编写的版本是正确的,因为它的目标是计算内存的数量。正如您所观察到的那样,他也可以编写oo,nd[cc].len--;(数两个内存),但在这种情况下,乍一看可能看起来像是一个错误。(顺便说一句,在您的示例中,两个内存的问题oo,nd[k].len--,nd[k].aux=i-1;来自负载和存储--;而不是两个存储。)

\n