Clojure:解析大日志文件时获取"OutOfMemoryError Java堆空间"

Jac*_*Liu 3 clojure out-of-memory

所有.
我想使用Clojure解析大日志文件.
并且每行记录的结构是"UserID,Lantitude,Lontitude,Timestamp".
我实现的步骤是:
---->读取日志文件并获取前n个用户列表
---->查找每个前n个用户的记录并存储在单独的日志文件(UserID.log)中.

工具源代码:

;======================================================
(defn parse-file
  ""
  [file n]
  (with-open [rdr (io/reader file)]
    (println "001 begin with open ")
    (let [lines (line-seq rdr)
          res (parse-recur lines)
          sorted
          (into (sorted-map-by (fn [key1 key2]
                                 (compare [(get res key2) key2]
                                          [(get res key1) key1])))
                res)]
      (println "Statistic result : " res)
      (println "Top-N User List : " sorted)
      (find-write-recur lines sorted n)
      )))

(defn parse-recur
  ""
  [lines]
  (loop [ls  lines
         res {}]
    (if ls
      (recur (next ls)
               (update-res res (first ls))) 
      res)))

(defn update-res
  ""
  [res line]
  (let [params (string/split line #",")
        id     (if (> (count params) 1) (params 0) "0")]
    (if (res id)
      (update-in res [id] inc)
      (assoc res id 1))))

(defn find-write-recur
  "Get each users' records and store into separate log file"
  [lines sorted n]
  (loop [x n
         sd sorted
         id (first (keys sd))]
    (if (and (> x 0) sd)
      (do (create-write-file id
                             (find-recur lines id))
          (recur (dec x)
                 (rest sd)
                 (nth (keys sd) 1))))))

(defn find-recur
  ""
  [lines id]
  (loop [ls lines
           res []]
    (if ls
      (recur (next ls)
               (update-vec res id (first ls)))
      res)))

(defn update-vec
  ""
  [res id line]
  (let [params (string/split line #",")
        id_        (if (> (count params) 1) (params 0) "0")]
        (if (= id id_ )
          (conj res line)
          res)))

(defn create-write-file
  "Create a new file and write information into the file."
  ([file info-lines]
   (with-open [wr (io/writer (str MAIN-PATH file))]
     (doseq [line info-lines] (.write wr (str line "\n")))
     ))
  ([file info-lines append?]
   (with-open [wr (io/writer (str MAIN-PATH file) :append append?)]
     (doseq [line info-lines] (.write wr (str line "\n"))))
   ))
;======================================================
Run Code Online (Sandbox Code Playgroud)

我用REPL命令(解析文件"./DATA/log.log"3)在REPL中测试了这个clj,并得到了结果:

记录-----大小-----时间----结果
1000 ------- 42KB ----- <1s ----- OK
10,000 ------ 420KB-- - <1s ----- OK
100,000 ----- 4.3MB ---- 3s ------ OK
1,000,000 --- 43MB ----- 15s ----- OK
6,000,000-- -258MB ----> 20M ----"OutOfMemoryError Java堆空间java.lang.String.substring(String.java:1913)"

================================================== ====
以下是问题:
1.当我尝试解析大日志文件时如何修复错误,如> 200MB
2.如何优化函数以更快地运行?
3.有超过1G大小的日志,该功能如何处理它.

我还是Clojure的新手,任何建议或解决方案都会很感激〜
谢谢

小智 7

作为您问题的直接答案; 从一点点Clojure经验.

  1. 内存耗尽的快速和脏修复归结为为JVM提供更多内存.您可以尝试将此添加到您的project.clj:

    :jvm-opts ["-Xmx1G"] ;; or more
    
    Run Code Online (Sandbox Code Playgroud)

    这将使Leiningen推出具有更高内存容量的JVM.

  2. 无论你如何工作,这种工作都会占用大量的内存.@Vidya建议使用图书馆绝对值得考虑.但是,您可以进行一项优化,这应该有所帮助.

    每当你处理你的(line-seq ...)对象(一个懒惰的序列)时,你应该确保将它保持为懒惰的seq.这样做next会立即将整个内容拉入内存.请rest改用.看一下clojure网站,特别是关于懒惰的部分:

    (休息 aseq) - 返回一个可能为空的seq,永远不会为零

    [剪断]

    (可能)延迟到剩余项目的路径(如果有的话)

    您甚至可能想要遍历日志两次 - 一次只将每行的用户名拉为lazy-seq,再次过滤掉这些用户.这将最大限度地减少您在任何时候保留的文件数量.

  3. 确保您的函数是惰性的应该减少将文件作为内存序列创建的纯粹开销.这是否足以解析1G文件,我认为我不能说.