从Common Lisp/SBCL中挤出更快的速度

Question

从Common Lisp/SBCL中挤出更快的速度

本文声称使某个Lisp程序的运行速度比其等效的C语言快.试图重现结果,我能够接近(Lisp比C慢50%),但想知道是否有人知道如何从SBCL 1.3.1中挤出更多的信息.

目标问题是为800 x 800单浮点数组中的每个单元添加一个常量单浮点数.方法是用C和Common Lisp编写程序并比较时间.使用这个便携式计时器,C代码如下:

#include <stddef.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <time.h>

#include "./modules/tictoc/tictoc.h"

const int HORZ = 800;
const int VERT = 800;

#define PERF_REPS 1000

typedef float DATA_T;

struct image_s {
    size_t n;
    size_t w, h;
    DATA_T * data;
};
typedef struct image_s image;

image * make_image (size_t w, size_t h) {
    size_t n = w * h;
    DATA_T * data = (DATA_T *)malloc(sizeof(DATA_T) * n);
    assert (NULL != data);
    image * result = (image *)malloc(sizeof(image));
    assert (NULL != result);
    result->n = n;
    result->w = w;
    result->h = h;
    result->data = data;
    return result;
}

void free_image (image * it) {
    assert (NULL != it);
    assert (NULL != it->data);
    free (it->data);
    free (it);
}

image * init_to_value (image * it, DATA_T val) {
    assert (NULL != it);
    assert (NULL != it->data);
    size_t i;
    const size_t n = it->n;
    for (i = 0; i < n; ++i) {
        it->data[i] = val;
    }
    return it;
}

void add (image * to, image * from, DATA_T val) {
    assert (NULL != to);
    assert (NULL != to->data);
    assert (NULL != from);
    assert (NULL != from->data);
    size_t i;
    const size_t n = to->n;
    for (i = 0; i < n; ++i) {
        to->data[i] = from->data[i] + val;
    }
}

int main (int argc, char ** argv) {
    image * from = init_to_value (make_image (HORZ, VERT), 0.0f);
    image * to = init_to_value (make_image (HORZ, VERT), 0.0f);
    TicTocTimer clock = tic();
    for (size_t i = 0; i < PERF_REPS; ++i)
        add (to, from, 42.0);
    printf("Elapsed time %f seconds.\n",toc(&clock));
    free_image (to);
    free_image (from);
    return 0;
}

Run Code Online (Sandbox Code Playgroud)

我编译并运行代码如下:

gcc -O3 image-add.c ./modules/tictoc/libtictoc.a && ./a.out

Run Code Online (Sandbox Code Playgroud)

我的mac book pro上的典型时间约为0.178秒.挺棒的.

使用我能在Lisp hyperspec中找到的每个选项,在新书Common Lisp Recipes和SBCL用户手册中使用的等效Lisp代码如下所示.评论表明我试过的一些事情并没有什么不同.

;;; None of the commented-out declarations made any difference in speed. 

(declaim (optimize speed (safety 0)))

(defun image-add (to from val)
  (declare (type (simple-array single-float (*))
                 to from))
  (declare (type single-float val))
  (let ((size (array-dimension to 0)))
    ;(declare (type fixnum size))
    (dotimes (i size)
      ;(declare (type fixnum i))
      (setf (aref to i) (+ (aref from i) val)))))

(defparameter HORZ 800)
(defparameter VERT 800)

(defparameter PERF-REPS 1000)

(let ((to (make-array (* HORZ VERT) :element-type 'single-float))
      (fm (make-array (* HORZ VERT) :element-type 'single-float)))
  ;(declare (type fixnum HORZ))
  ;(declare (type fixnum VERT))
  (time (dotimes (_ PERF-REPS)
          ;(declare (type fixnum PERF-REPS))
          ;(declare (type fixnum _))
          ;(declare (inline image-add))
          (image-add to fm 42.0))))

Run Code Online (Sandbox Code Playgroud)

我按如下方式编译并运行它:

sbcl --script image-perf.lisp

Run Code Online (Sandbox Code Playgroud)

典型的运行时间是0.276.不错,但我希望它好多了.当然,练习的重点是Lisp代码更短,但是有没有人知道如何快速或更快地使用它？

Answer 1

Jos*_*lor 9

SBCL提供了大量信息

当我保存你的代码(并将最终的let示例包装到一个单独的函数中),并用SBCL编译它时,我实际上得到了一堆诊断输出,告诉我们一些可以生成更好代码的地方.有很多,但是在这里略读,虽然它都在测试中,所以它可能有用也可能没用.但是,由于测试代码可能会降低速度,因此值得一试.

CL-USER> (compile-file ".../compile.lisp")
; compiling file ".../compile.lisp" (written 25 JAN 2016 01:53:23 PM):
; compiling (DECLAIM (OPTIMIZE SPEED ...))
; compiling (DEFUN IMAGE-ADD ...)
; compiling (DEFPARAMETER HORZ ...)
; compiling (DEFPARAMETER VERT ...)
; compiling (DEFPARAMETER PERF-REPS ...)
; compiling (DEFUN TEST ...)

; file: /home/taylorj/tmp/compile.lisp
; in: DEFUN TEST
;     (* HORZ VERT)
; 
; note: unable to
;   convert x*2^k to shift
; due to type uncertainty:
;   The first argument is a NUMBER, not a INTEGER.
;   The second argument is a NUMBER, not a INTEGER.
; 
; note: unable to
;   convert x*2^k to shift
; due to type uncertainty:
;   The first argument is a NUMBER, not a INTEGER.
;   The second argument is a NUMBER, not a INTEGER.

;     (DOTIMES (_ PERF-REPS) (IMAGE-ADD TO FM 42.0))
; --> DO BLOCK LET TAGBODY UNLESS IF >= IF 
; ==>
;   (< SB-C::X SB-C::Y)
; 
; note: forced to do GENERIC-< (cost 10)
;       unable to do inline fixnum comparison (cost 4) because:
;       The first argument is a UNSIGNED-BYTE, not a FIXNUM.
;       The second argument is a INTEGER, not a FIXNUM.

; --> DO BLOCK LET TAGBODY PSETQ PSETF LET* MULTIPLE-VALUE-BIND LET 1+ 
; ==>
;   (+ _ 1)
; 
; note: forced to do GENERIC-+ (cost 10)
;       unable to do inline fixnum arithmetic (cost 1) because:
;       The first argument is a UNSIGNED-BYTE, not a FIXNUM.
;       The result is a (VALUES (INTEGER 1) &OPTIONAL), not a (VALUES FIXNUM
;                                                                     &REST T).
;       unable to do inline fixnum arithmetic (cost 2) because:
;       The first argument is a UNSIGNED-BYTE, not a FIXNUM.
;       The result is a (VALUES (INTEGER 1) &OPTIONAL), not a (VALUES FIXNUM
;                                                                     &REST T).
;       etc.

;     (* HORZ VERT)
; 
; note: forced to do GENERIC-* (cost 30)
;       unable to do inline fixnum arithmetic (cost 4) because:
;       The first argument is a NUMBER, not a FIXNUM.
;       The second argument is a NUMBER, not a FIXNUM.
;       unable to do inline (signed-byte 64) arithmetic (cost 5) because:
;       The first argument is a NUMBER, not a (SIGNED-BYTE 64).
;       The second argument is a NUMBER, not a (SIGNED-BYTE 64).
;       etc.
; 
; note: forced to do GENERIC-* (cost 30)
;       unable to do inline fixnum arithmetic (cost 4) because:
;       The first argument is a NUMBER, not a FIXNUM.
;       The second argument is a NUMBER, not a FIXNUM.
;       unable to do inline (signed-byte 64) arithmetic (cost 5) because:
;       The first argument is a NUMBER, not a (SIGNED-BYTE 64).
;       The second argument is a NUMBER, not a (SIGNED-BYTE 64).
;       etc.
; 
; compilation unit finished
;   printed 6 notes

; .../compile.fasl written
; compilation finished in 0:00:00.009

Run Code Online (Sandbox Code Playgroud)

这些东西都在测试中,但是既然你正在计算你的dotimes循环,那么至少可以修正那个比较是有意义的.这是带有声明的测试代码,可以使dotimes更快一些:

(defun test ()
  (declare (type fixnum HORZ VERT PERF-REPS))
  (let ((to (make-array (* HORZ VERT) :element-type 'single-float))
        (fm (make-array (* HORZ VERT) :element-type 'single-float)))
    (time (dotimes (_ PERF-REPS)
            (image-add to fm 42.0)))))

Run Code Online (Sandbox Code Playgroud)

之后,您可能希望查看可能的循环展开,缓存数组维度以及考虑数组的内存位置.不过,这些都是非常通用的提示.我不确定具体的东西可以在这里提供更多帮助.

关于"一定要使用优化和安全"的标准答案

编辑:拍,我错过了顶级朗诵这确实参考速度和安全性.仍然值得检查它们是否具有预期的效果,但如果它们是,那么这个答案大多是多余的.

在大多数情况下,允许编译器完全忽略声明.(唯一的例外是特殊声明,它改变了变量的绑定语义.)因此编译器对它们的作用取决于它.类似的声明类型至少可以通过两种不同的方式来使用.如果您正在尝试编译非常安全的代码,请键入声明,让编译器知道可以添加其他检查.当然,这会导致代码变慢,但会更安全.另一方面,如果您尝试生成非常快的代码,那么编译器可以将这些类型声明作为您保证值始终为正确类型的保证,从而生成更快的代码.

看起来你只是添加了类型声明.如果您想要更快(或更安全)的代码,您还需要添加声明来说明这一点.在这种情况下,您可能需要(声明(优化(速度3)(安全0))).例如,看一下只是一个返回其fixnum参数的简单函数的一些解组.首先,只是声明类型,代码最终为18个字节:

(defun int-identity (x)
  (declare (type fixnum x))
  x)

INT-IDENTITY
CL-USER> (disassemble 'int-identity)
; disassembly for INT-IDENTITY
; Size: 18 bytes. Origin: #x100470619A
; 9A:       488BE5           MOV RSP, RBP                     ; no-arg-parsing entry point
; 9D:       F8               CLC
; 9E:       5D               POP RBP
; 9F:       C3               RET
; A0:       CC0A             BREAK 10                         ; error trap
; A2:       02               BYTE #X02
; A3:       19               BYTE #X19                        ; INVALID-ARG-COUNT-ERROR
; A4:       9A               BYTE #X9A                        ; RCX
; A5:       CC0A             BREAK 10                         ; error trap
; A7:       04               BYTE #X04
; A8:       08               BYTE #X08                        ; OBJECT-NOT-FIXNUM-ERROR
; A9:       FE1B01           BYTE #XFE, #X1B, #X01            ; RDX
NIL

Run Code Online (Sandbox Code Playgroud)

现在,如果我们还添加了速度优化,代码大小变了一点点.(但这并不一定是坏事.一些速度优化,例如循环展开或函数内联,将产生更大的代码.)

CL-USER> 
(defun int-identity (x)
  (declare (type fixnum x)
           (optimize (speed 3)))
  x)

STYLE-WARNING: redefining COMMON-LISP-USER::INT-IDENTITY in DEFUN
INT-IDENTITY
CL-USER> (disassemble 'int-identity)
; disassembly for INT-IDENTITY
; Size: 20 bytes. Origin: #x1004A5D23D
; 3D:       488BE5           MOV RSP, RBP                     ; no-arg-parsing entry point
; 40:       F8               CLC
; 41:       5D               POP RBP
; 42:       C3               RET
; 43:       CC0A             BREAK 10                         ; error trap
; 45:       04               BYTE #X04
; 46:       19               BYTE #X19                        ; INVALID-ARG-COUNT-ERROR
; 47:       FE9A01           BYTE #XFE, #X9A, #X01            ; RBX
; 4A:       CC0A             BREAK 10                         ; error trap
; 4C:       04               BYTE #X04
; 4D:       08               BYTE #X08                        ; OBJECT-NOT-FIXNUM-ERROR
; 4E:       FE1B01           BYTE #XFE, #X1B, #X01            ; RDX
NIL

Run Code Online (Sandbox Code Playgroud)

最后,当我们删除安全性时,我们最终得到一些非常短的代码,只有9个字节:

CL-USER> 
(defun int-identity (x)
  (declare (type fixnum x)
           (optimize (speed 3)
                     (safety 0)))
  x)

STYLE-WARNING: redefining COMMON-LISP-USER::INT-IDENTITY in DEFUN
INT-IDENTITY
CL-USER> (disassemble 'int-identity)
; disassembly for INT-IDENTITY
; Size: 9 bytes. Origin: #x1004AFF3E2
; 2:       488BD1           MOV RDX, RCX                      ; no-arg-parsing entry point
; 5:       488BE5           MOV RSP, RBP
; 8:       F8               CLC
; 9:       5D               POP RBP
; A:       C3               RET

Run Code Online (Sandbox Code Playgroud)

Answer 2

cor*_*ump 9

作为参考,这里是一些稍微修改过的版本的结果.

C版

C版本平均为0.197s.

Lisp版本

(declaim (optimize (speed 3) (debug 0) (safety 0)))

(defconstant HORZ 800)
(defconstant VERT 800)
(defconstant PERF-REPS 1000)

(defun test ()
  (let ((target #1=(make-array (* HORZ VERT)
                               :element-type 'single-float
                               :initial-element 0f0))
        (source #1#))
    (declare (type (simple-array single-float (*)) target source))
    (time 
      (dotimes (_ PERF-REPS)
        (map-into target
                  (lambda (x)
                    (declare (single-float x))
                    (the single-float (+ x 42f0)))
                  source)))))

Run Code Online (Sandbox Code Playgroud)

这是输出:

Evaluation took:                                                                                                 
  0.372 seconds of real time                                                                                     
  0.372024 seconds of total run time (0.368023 user, 0.004001 system)                                            
  100.00% CPU                                                                                                    
  965,075,988 processor cycles                                                                                   
  0 bytes consed

Run Code Online (Sandbox Code Playgroud)

替换map-into为lparallel:pmap-into,最短的时间是由4名工人组成的内核获得,并给出:

Evaluation took:                                                                                                 
 0.122 seconds of real time                                                                                     
 0.496031 seconds of total run time (0.492030 user, 0.004001 system)                                            
 406.56% CPU                                                                                                    
 316,445,789 processor cycles                                                                                   
 753,280 bytes consed

Run Code Online (Sandbox Code Playgroud)

请注意内存使用量的差异.

Answer 3

bpe*_*sek 6

这是一篇相当老的文章，但您可能仍然感兴趣。

如果我追求速度，我会这样做！

当我们比较 C 和 Common Lisp 的速度时，我们需要考虑 C 编译器在高优化设置下可以做什么。

由于带 -O3 的 C 会执行自动矢量化操作，而 SBCL 不会执行自动矢量化操作，因此我们在进行比较时必须小心。

正如我们在下面看到的那样，通过矢量化，SBCL 应该足够快。

请先将sb-simd安装到 ~/quicklisp/local-projects。

然后运行 (ql:quickload :sb-simd) 来构建它。

您需要 sbcl-2.1.11 才能运行最新的 sb-simd。

生成的矢量化 SBCL 可执行映像比同时进行矢量化和最有可能使用 -O3 编译器标志进行循环展开的 C 语言更快。

编辑：我刚刚检查了生成的程序集，它是矢量化的。使用 --march=native --mavx2 ，速度确实略有提高。

(declaim (optimize speed (safety 0)))
(ql:quickload :sb-simd :silent t)
(use-package :sb-simd)

(declaim (ftype (function (f32vec f32vec f32) null) image-add))
(defun image-add (to from val)
  (do-vectorized (i 0 (array-total-size to))
    (:unroll 2)
    (setf (f32-aref to i) (f32+ (f32-aref from i) val))))

(defparameter *HORZ* 800)
(defparameter *VERT* 800)

(defparameter *PERF-REPS* 1000)

(defun main ()
  (declare (type fixnum *HORZ* *VERT* *PERF-REPS*))  
  (let ((to   (make-array (* *HORZ* *VERT*) :element-type 'f32))
        (from (make-array (* *HORZ* *VERT*) :element-type 'f32)))
    (time (dotimes (_ *PERF-REPS*)
            (image-add to from 42.0)))))

(save-lisp-and-die "image-add" :executable t :toplevel #'main)

Run Code Online (Sandbox Code Playgroud)

编辑：在我的 i7-7700HQ CPU 上我得到：

$ gcc -O3 image-add.c ./tictoc/build/libtictoc.a
$ ./a.out 
Elapsed time 0.124956 seconds.

$ gcc -O3 -march=native -mavx2 image-add.c ./tictoc/build/libtictoc.a
$ ./a.out
Elapsed time 0.116989 seconds.

$ ./image-add 
Evaluation took:
  0.072 seconds of real time
  0.070440 seconds of total run time (0.070236 user, 0.000204 system)
  97.22% CPU
  199,079,678 processor cycles
  0 bytes consed

Run Code Online (Sandbox Code Playgroud)

虽然这是一篇相当老的帖子，但我认为你@Red.Cabin 可能仍然感兴趣。 (2认同)

Answer 4

Reb*_*bin 5

使用:lparallel@coredump 的建议，我能够获得 0.125 秒的一致运行，明显快于 gcc -O3 的 0.175。编译文件和内联函数的注释中建议的各种技术image-add并没有产生明显的加速。这是迄今为止最快的代码。

(load "~/quicklisp/setup.lisp")
(ql:quickload :lparallel)

(declaim (optimize (speed 3) (safety 0)))

(defparameter HORZ 800)
(defparameter VERT 800)

(defparameter PERF-REPS 1000)

(setf lparallel:*kernel* (lparallel:make-kernel 4))

(defun test ()
  (declare (type fixnum HORZ VERT PERF-REPS))
  (let ((to (make-array (* HORZ VERT) :element-type 'single-float))
        (fm (make-array (* HORZ VERT) :element-type 'single-float)))
    (time (dotimes (_ PERF-REPS)
            (lparallel:pmap-into to (lambda (x) (+ x 42f0)) fm)))))

(test)

Run Code Online (Sandbox Code Playgroud)

编辑：我会注意到这并不完全公平：我向 Lisp 代码添加了显式并行性，但没有向 C 代码添加了显式并行性。然而，值得注意的是，使用 Lisp 来做到这一点是多么容易。因为在这种情况下 Lisp 相对于 C 的主要优点是代码简洁以及添加并行性等功能相对容易，所以这种权衡是朝着正确的方向（说明 Lisp 的相对灵活性）。我怀疑并行 C 代码（如果我能抽出时间来实现它）将再次比 Lisp 代码更快。

归档时间：	10 年前
查看次数：	3000 次
最近记录：	10 年前