clojure读取大文本文件并计算出现次数

Rob*_*ler 4 clojure

我正在尝试读取大型文本文件并计算特定错误的发生次数.例如,对于以下示例文本

something
bla
error123
foo
test
error123
line
junk
error55
more
stuff
Run Code Online (Sandbox Code Playgroud)

我想最终(虽然我在想地图,但不关心什么数据结构)

error123 - 2
error55 - 1
Run Code Online (Sandbox Code Playgroud)

这是我到目前为止所尝试的

(require '[clojure.java.io :as io])

(defn find-error [line]
  (if (re-find #"error" line)    
       line))


(defn read-big-file [func, filename]
 (with-open [rdr (io/reader filename)]
   (doall (map func (line-seq rdr)))))  
Run Code Online (Sandbox Code Playgroud)

这样称呼它

 (read-big-file find-error "sample.txt")
Run Code Online (Sandbox Code Playgroud)

收益:

(nil nil "error123" nil nil "error123" nil nil "error55" nil nil)
Run Code Online (Sandbox Code Playgroud)

接下来,我尝试删除nil值并将项目分组

(group-by identity (remove #(= nil %) (read-big-file find-error "sample.txt")))
Run Code Online (Sandbox Code Playgroud)

返回

{"error123" ["error123" "error123"], "error55" ["error55"]}
Run Code Online (Sandbox Code Playgroud)

这接近期望的输出,尽管可能效率不高.我现在该如何获得计数?此外,作为一个新的clojure和函数式编程的人,我将不胜感激任何关于如何改进它的建议.谢谢!

man*_*nge 7

我想你可能正在寻找频率功能:

user=> (doc frequencies)
-------------------------
clojure.core/frequencies
([coll])
  Returns a map from distinct items in coll to the number of times
  they appear.
nil
Run Code Online (Sandbox Code Playgroud)

所以,这应该给你你想要的东西:

(frequencies (remove nil? (read-big-file find-error "sample.txt")))
;;=> {"error123" 2, "error55" 1}
Run Code Online (Sandbox Code Playgroud)

但是,如果您的文本文件非常大,我建议line-seq您在内联中执行此操作以确保不会耗尽内存.这样你也可以使用filter而不是mapremove.

(defn count-lines [pred, filename]
  (with-open [rdr (io/reader filename)]
    (frequencies (filter pred (line-seq rdr)))))

(defn is-error-line? [line]
  (re-find #"error" line))

(count-lines is-error-line? "sample.txt")
;; => {"error123" 2, "error55" 1}
Run Code Online (Sandbox Code Playgroud)