用Clojure解析数据,区间问题

And*_*sio 10 clojure

我正在为了学习目的而在clojure中编写一个小解析器.基本上是需要放在数据库中的TSV文件解析器,但我添加了一个复杂的.复杂性本身就是在同一个文件中有更多的间隔.该文件如下所示:

###andreadipersio 2010-03-19 16:10:00###                                                                                
USER     COMM               PID  PPID  %CPU %MEM      TIME  
root     launchd              1     0   0.0  0.0   2:46.97  
root     DirectoryService    11     1   0.0  0.2   0:34.59  
root     notifyd             12     1   0.0  0.0   0:20.83  
root     diskarbitrationd    13     1   0.0  0.0   0:02.84`
....

###andreadipersio 2010-03-19 16:20:00###                                                                                
USER     COMM               PID  PPID  %CPU %MEM      TIME  
root     launchd              1     0   0.0  0.0   2:46.97  
root     DirectoryService    11     1   0.0  0.2   0:34.59  
root     notifyd             12     1   0.0  0.0   0:20.83  
root     diskarbitrationd    13     1   0.0  0.0   0:02.84
Run Code Online (Sandbox Code Playgroud)

我最终得到了这段代码:

(defn is-header? 
  "Return true  if a line is header"
  [line]
  (> (count (re-find #"^\#{3}" line)) 0))

(defn extract-fields
  "Return regex matches"
  [line pattern]
  (rest (re-find pattern line)))

(defn process-lines
  [lines]
  (map process-line lines))

(defn process-line
  [line]
  (if (is-header? line)
    (extract-fields line header-pattern))
  (extract-fields line data-pattern))
Run Code Online (Sandbox Code Playgroud)

我的想法是,'process-line'间隔需要与数据合并,所以我有这样的事情:

('andreadipersio', '2010-03-19', '16:10:00', 'root', 'launchd', 1, 0, 0.0, 0.0, '2:46.97')
Run Code Online (Sandbox Code Playgroud)

对于每一行直到下一个间隔,但我无法想象如何实现这一点.

我尝试过这样的事情:

(def process-line
  [line]
  (if is-header? line)
    (def header-data (extract-fields line header-pattern)))
  (cons header-data (extract-fields line data-pattern)))
Run Code Online (Sandbox Code Playgroud)

但这不是例外.

任何提示?

谢谢!

Mic*_*zyk 6

一种可能的方法:

  1. 将输入拆分为行line-seq.(如果你想在字符串上测试它,你可以line-seq通过这样做来获得它(line-seq (java.io.BufferedReader. (java.io.StringReader. test-string))).)

  2. 将其划分为子序列,每个子序列包含单个标题行或一些"处理行" (clojure.contrib.seq/partition-by is-header? your-seq-of-lines).

  3. 假设在每个标题之后至少有一个过程行,(partition 2 *2)(在*2上面的步骤2中获得的序列)将返回类似于以下形式的序列:(((header-1) (process-line-1 process-line-2)) ((header-2) (process-line-3 process-line-4))).如果输入可能包含一些标题行,后面没有任何数据行,那么上面的内容可能就像(((header-1a header-1b) (process-line-1 process-line-2)) ...).

  4. 最后,*3使用以下函数转换步骤3()的输出:


(defn extract-fields-add-headers
  [[headers process-lines]]
  (let [header-fields (extract-fields (last headers) header-pattern)]
    (map #(concat header-fields (extract-fields % data-pattern))
         process-lines)))
Run Code Online (Sandbox Code Playgroud)

(解释(last headers)一下:我们在这里得到多个标题的唯一情况是它们中的一些没有自己的数据行;实际附加到数据行的那个是最后一个.)


使用这些示例模式:

(def data-pattern #"(\w+)\s+(\w+)\s+(\d+)\s+(\d+)\s+([0-9.]+)\s+([0-9.]+)\s+([0-9:.]+)")
(def header-pattern #"###(\w+)\s+([0-9-]+)\s+([0-9:]+)###")
;; we'll need to throw out the "USER  COMM  ..." lines,
;; empty lines and the "..." line which I haven't bothered
;; to remove from your sample input
(def discard-pattern #"^USER\s+COMM|^$|^\.\.\.")
Run Code Online (Sandbox Code Playgroud)

整个'管道'可能看起来像这样:

;; just a reminder, normally you'd put this in an ns form:
(use '[clojure.contrib.seq :only (partition-by)])

(->> (line-seq (java.io.BufferedReader. (java.io.StringReader. test-data)))
     (remove #(re-find discard-pattern %)) ; throw out "USER  COMM ..."
     (partition-by is-header?)
     (partition 2)
     ;; mapcat performs a map, then concatenates results
     (mapcat extract-fields-add-headers))
Run Code Online (Sandbox Code Playgroud)

(line-seq大概在最终节目中从不同来源获取输入.)

使用您的示例输入,上面会生成这样的输出(为了清晰起见,添加了换行符):

(("andreadipersio" "2010-03-19" "16:10:00" "root" "launchd" "1" "0" "0.0" "0.0" "2:46.97")
 ("andreadipersio" "2010-03-19" "16:10:00" "root" "DirectoryService" "11" "1" "0.0" "0.2" "0:34.59")
 ("andreadipersio" "2010-03-19" "16:10:00" "root" "notifyd" "12" "1" "0.0" "0.0" "0:20.83")
 ("andreadipersio" "2010-03-19" "16:10:00" "root" "diskarbitrationd" "13" "1" "0.0" "0.0" "0:02.84")
 ("andreadipersio" "2010-03-19" "16:20:00" "root" "launchd" "1" "0" "0.0" "0.0" "2:46.97")
 ("andreadipersio" "2010-03-19" "16:20:00" "root" "DirectoryService" "11" "1" "0.0" "0.2" "0:34.59")
 ("andreadipersio" "2010-03-19" "16:20:00" "root" "notifyd" "12" "1" "0.0" "0.0" "0:20.83")
 ("andreadipersio" "2010-03-19" "16:20:00" "root" "diskarbitrationd" "13" "1" "0.0" "0.0" "0:02.84"))
Run Code Online (Sandbox Code Playgroud)


Bri*_*per 4

你正在做(> (count (re-find #"^\#{3}" line)) 0),但你可以做(re-find #"^\#{3}" line)并将结果用作布尔值。 如果匹配失败则re-find返回。nil

如果您正在迭代集合中的项目,并且想要跳过某些项目或将原始项目中的两个或多个项目合并为结果中的一个项目,那么 99% 的情况下您都希望这样做reduce。这通常会非常简单。

;; These two libs are called "io" and "string" in bleeding-edge clojure-contrib
;; and some of the function names are different.
(require '(clojure.contrib [str-utils :as s]
                           [duck-streams :as io])) ; SO's syntax-highlighter still sucks

(defn clean [line]
  (s/re-gsub #"^###|###\s*$" "" line))

(defn interval? [line]
  (re-find #"^#{3}" line))

(defn skip? [line]
  (or (empty? line)
      (re-find #"^USER" line)))

(defn parse-line [line]
  (s/re-split #"\s+" (clean line)))

(defn parse [file]
  (first
   (reduce
    (fn [[data interval] line]
      (cond
       (interval? line) [data (parse-line line)]
       (skip? line)     [data interval]
       :else            [(conj data (concat interval (parse-line line))) interval]))
    [[] nil]
    (io/read-lines file))))
Run Code Online (Sandbox Code Playgroud)

  • 这可能与手头的示例有任何关系,也可能没有,但我不同意关于“reduce”适合此类任务的说法。在 Clojure 中,“reduce”始终是严格的,因为它始终会在其任何部分可用于处理之前在内存中具体化整个结果(因为 Clojure 的“reduce”是左折叠)。这与延迟转换彼此分层(输入序列位于堆栈底部)的方法形成鲜明对比,在这种方法中,可以以块的形式生成结果。 (2认同)