是否应该以命令式方式处理递归文件系统算法？

Question

是否应该以命令式方式处理递归文件系统算法？

我刚刚完成"关于JVM并发编程",由Venkat廉读,在这本书中,作者以作为他的一个例子,在一个目录树计数的文件大小.他展示了不使用并发,使用队列,使用锁存器和使用scala actor的实现.在我的系统上,当遍历我的/ usr目录(OSX 10.6.8,Core Duo 2 Ghz,Intel G1 ssd 160GB)时,所有并发实现(队列,latch和scala actor)都能在9秒内运行.

我正在学习Clojure,并决定使用代理将Scala actor版本移植到Clojure.不幸的是,我的平均时间是11-12秒,这明显慢于其他人.花了DAYS把我的头发拉出来之后,我发现下面的代码是罪魁祸首(processFile是我发送给文件处理代理的一个函数:

(defn processFile
  [fileProcessor collectorAgent ^String fileName]
  (let [^File file-obj (File. ^String fileName)
        fileTotals (transient {:files 0, :bytes 0})]
    (cond
      (.isDirectory file-obj)
        (do
          (doseq [^File dir (.listFiles file-obj) :when (.isDirectory dir)]
            (send collectorAgent addFileToProcess (.getPath dir)))
          (send collectorAgent tallyResult *agent*)
          (reduce (fn [currentTotal newItem] (assoc! currentTotal :files (inc (:files currentTotal))
                                                                  :bytes (+ (:bytes currentTotal) newItem)))
                  fileTotals
                  (map #(.length ^File %) (filter #(.isFile ^File %) (.listFiles file-obj))))
          (persistent! fileTotals))

      (.isFile file-obj) (do (send collectorAgent tallyResult *agent*) {:files 1, :bytes (.length file-obj)}))))

Run Code Online (Sandbox Code Playgroud)

您会注意到我尝试使用类型提示和瞬态来提高性能,但都无济于事.我用以下代码替换了上面的代码:

(defn processChildren
  [children]
  (loop [entries children files 0 bytes 0 dirs '()]
    (let [^File child (first entries)]
      (cond
        (not (seq entries)) {:files files, :bytes bytes, :dirs dirs}
        (.isFile child) (recur (rest entries) (inc files) (+ bytes (.length child)) dirs)
        (.isDirectory child) (recur (rest entries) files bytes (conj dirs child))
        :else (recur (rest entries) files bytes dirs)))))

(defn processFile
  [fileProcessor collectorAgent ^String fileName]
  (let [{files :files, bytes :bytes, dirs :dirs} (processChildren (.listFiles (File. fileName)))]
    (doseq [^File dir dirs]
      (send collectorAgent addFileToProcess (.getPath dir)))
    (send collectorAgent tallyResult *agent*)
    {:files files, :bytes bytes}))

Run Code Online (Sandbox Code Playgroud)

如果不比Scala版本更快,则此版本在par上执行,并且几乎与Scala版本中使用的算法相同.我只是假设算法的功能方法也能正常工作.

所以......这个冗长的问题归结为以下几点: 为什么第二个版本更快？

我的假设是,虽然使用map/filter/reduce对目录内容的第一个版本比第二个版本对目录的相当必要的处理更"有用",但它的效率要低得多,因为目录的内容正在通过多个迭代倍.由于文件系统I/O很慢,整个程序都会受到影响.

假设我是对的,那么说任何递归文件系统算法应该更倾向于采用必要的性能方法吗？

我是Clojure的初学者,所以如果我做一些愚蠢或非惯用的事情,请随意将我的代码撕成碎片.

Answer 1

ama*_*loy 4

我编辑了第一个版本以使其更具可读性。我有一些评论，但没有确实有用的陈述：

您添加了瞬变和类型提示，但没有真正的证据表明是什么导致速度变慢。如果不小心应用这些操作，完全有可能显着减慢速度，因此最好进行分析以找出真正减慢速度的原因。您的选择似乎很合理，但我删除了显然没有效果的类型提示（例如，编译器不需要提示就知道 (File. ...) 生成 File 对象）。
Clojure（事实上，任何 lisp）都强烈some-agent喜欢someAgent. 前缀语法意味着不用担心会-被无知的编译器解析为减法，因此我们可以提供更间隔良好的名称。
您包含对此处根本未定义的一堆函数的调用，例如tallyResult 和addFileToProcess。想必它们表现良好，因为您在高性能版本中使用它们，但是如果不包含它们，其他人就很难对其进行研究并了解是什么加快了速度。
对于 I/O 绑定操作，请考虑使用 send-off 而不是 send：send 使用有界线程池来避免处理器陷入困境。在这里，这可能并不重要，因为您只使用一个代理并且它会序列化，但将来您会遇到重要的情况。

不管怎样，正如所承诺的，对你的第一个版本进行更清晰的重写：

(defn process-file
  [_ collector-agent ^String file-name]
  (let [file-obj (File. file-name)
        file-totals (transient {:files 0, :bytes 0})]
    (cond (.isDirectory file-obj)
          (do
            (doseq [^File dir (.listFiles file-obj)
                    :when (.isDirectory dir)]
              (send collector-agent addFileToProcess
                    (.getPath dir)))
            (send collector-agent tallyResult *agent*)
            (reduce (fn [current-total new-item]
                      (assoc! current-total
                              :files (inc (:files current-total))
                              :bytes (+ (:bytes current-total) new-item)))
                    file-totals
                    (map #(.length ^File %)
                         (filter #(.isFile ^File %)
                                 (.listFiles file-obj)))) -
            (persistent! file-totals))

          (.isFile file-obj)
          (do (send collector-agent tallyResult *agent*)
              {:files 1, :bytes (.length file-obj)}))))

Run Code Online (Sandbox Code Playgroud)

编辑：您以错误的方式使用瞬态，丢弃了减少的结果。(assoc! m k v)允许修改并返回对象，但如果更方便或更有效的话，可能会返回不同的对象。m所以你需要更多类似的东西(persistent! (reduce ...))

归档时间：	14 年，2 月前
查看次数：	261 次
最近记录：	13 年，1 月前