如何使用 Clojure 读取/解析以下文本?

chu*_*nsj 1 parsing clojure

Text的结构是这样的;

Tag001
 0.1, 0.2, 0.3, 0.4
 0.5, 0.6, 0.7, 0.8
 ...
Tag002
 1.1, 1.2, 1.3, 1.4
 1.5, 1.6, 1.7, 1.8
 ...
Run Code Online (Sandbox Code Playgroud)

文件可以有任意数量的 TagXXX 内容,每个标签可以有任意数量的 CSV 值行。

==== 购买力平价。(对这些东西感到抱歉:-)

更多改进;现在我的Atom笔记本上31842行数据需要1秒左右,比原始代码快7倍。然而,C 版本比这个快 20 倍。

(defn add-parsed-code [accu code]
  (if (empty? code)
    accu
    (conj accu code)))

(defn add-values [code comps]
  (let [values comps
        old-values (:values code)
        new-values (if old-values
                     (conj old-values values)
                     [values])]
    (assoc code :values new-values)))

(defn read-line-components [file]
  (map (fn [line] (clojure.string/split line #","))
       (with-open [rdr (clojure.java.io/reader file)]
         (doall (line-seq rdr)))))

(defn parse-file [file]
  (let [line-comps (read-line-components file)]
    (loop [line-comps line-comps
           accu []
           curr {}]
      (if line-comps
        (let [comps (first line-comps)]
          (if (= (count comps) 1) ;; code line?
            (recur (next line-comps)
                   (add-parsed-code accu curr)
                   {:code (first comps)})
            (recur (next line-comps)
                   accu
                   (add-values curr comps))))
        (add-parsed-code accu curr)))))
Run Code Online (Sandbox Code Playgroud)

==== PPS。

虽然我不明白为什么第一个比第二个快 10 倍,但与 slurp 不同,map 和 with-open 确实使阅读速度更快;尽管整个读取/处理时间并没有减少(从 7 秒减少到 6 秒)

(time
 (let [lines (map (fn [line] line)
                  (with-open [rdr (clojure.java.io/reader
                                   "DATA.txt")]
                    (doall (line-seq rdr))))]
   (println (last lines))))

(time (let [lines
            (clojure.string/split-lines
             (slurp "DATA.txt"))]
        (println (last lines))))
Run Code Online (Sandbox Code Playgroud)

==== PS。斯库罗的解决方案确实有效。但解析速度不是那么快,所以我必须使用基于 C 的解析器(在 1~3 秒内读取 400 个文件,而 clojure 单个文件需要 1~4 秒;是的,文件大小相当大)进行读取和构建 DB 和 Clojure 仅用于统计分析部分。

sku*_*uro 5

下面解析上述文件,将任何值行分隔开。如果这不是您想要的,您可以更改该add-values功能。解析状态保存在curr变量中,同时accu保存先前解析的标签(即找到“TagXXX”之前出现的所有行)。它允许不带标签的值:

更新:副作用现在封装在专用load-file函数中

(defn tag? [line]
  (re-matches #"Tag[0-9]*" line))

; potentially unsafe, you might want to change this:
(defn parse-values [line]
  (read-string (str "[" line "]")))

(defn add-parsed-tag [accu tag]
  (if (empty? tag)
      accu
      (conj accu tag)))

(defn add-values [tag line]
  (let [values (parse-values line)
        old-values (:values tag)
        new-values (if old-values
                       (conj old-values values)
                       [values])]
    (assoc tag :values new-values)))

(defn load-file [path]
  (slurp path))

(defn parse-file [file]
  (let [lines (clojure.string/split-lines file)]
    (loop [lines lines ; remaining lines 
           accu []     ; already parsed tags
           curr {}]    ; current tag being parsed
          (if lines
              (let [line (first lines)]
                (if (tag? line)
                    ; we recur after starting a new tag
                    ; if curr is empty we don't add it to the accu (e.g. first iteration)
                    (recur (next lines)
                           (add-parsed-tag accu curr)
                           {:tag line})
                    ; we're parsing values for a currentl tag
                    (recur (next lines)
                           accu
                           (add-values curr line))))
              ; if we were parsing a tag, we need to add it to the final result
              (add-parsed-tag accu curr)))))
Run Code Online (Sandbox Code Playgroud)

我对上面的代码不太感兴趣,但它完成了工作。给定一个如下文件:

Tag001
 0.1, 0.2, 0.3, 0.4
 0.5, 0.6, 0.7, 0.8
Tag002
 1.1, 1.2, 1.3, 1.4
 1.5, 1.6, 1.7, 1.8
Tag003
 1.1, 1.2, 1.3, 1.4
 1.1, 1.2, 1.3, 1.4
 1.5, 1.6, 1.7, 1.8
 1.5, 1.6, 1.7, 1.8
Run Code Online (Sandbox Code Playgroud)

它产生以下结果:

user=> (clojure.pprint/print-table [:tag :values] (parse-file (load-file "tags.txt")))
================================================================
:tag   | :values
================================================================
Tag001 | [[0.1 0.2 0.3 0.4] [0.5 0.6 0.7 0.8]]
Tag002 | [[1.1 1.2 1.3 1.4] [1.5 1.6 1.7 1.8]]
Tag003 | [[1.1 1.2 1.3 1.4] [1.1 1.2 1.3 1.4] [1.5 1.6 1.7 1.8] [1.5 1.6 1.7 1.8]]
================================================================
Run Code Online (Sandbox Code Playgroud)