为什么这个Clojure代码内存不足?

use*_*464 3 clojure

我有一个二千万行,已排序的文本文件.它有很多重复的行.我有一些Clojure代码可以计算每个唯一行的实例数,即输出类似于:

alpha 20
beta 17
gamma 3
delta 4
...
Run Code Online (Sandbox Code Playgroud)

该代码适用于较小的文件,但在较大的文件上,它会耗尽内存.我究竟做错了什么?我假设在某个地方,我坚持到头.

(require '[clojure.java.io :as io])

(def bi-grams (line-seq (io/reader "the-big-input-file.txt")))

(defn quick-process [input-list filename]
    (with-open [out (io/writer filename)] ;; e.g. "train/2gram-freq.txt"
        (binding [*out* out]
           (dorun (map (fn [[w v]] (println w "\t" (count v)))
                       (partition-by identity input-list)))

(quick-process bi-grams "output.txt")
Run Code Online (Sandbox Code Playgroud)

man*_*nge 7

你的bi-grams变量正在坚持line-seq.

试试(quick-process (line-seq (io/reader "the-big-input-file.txt")) "output.txt").

  • 在clojure中坚持seq的头部字面意思让你抓住你的头 (5认同)