如何找到2个数据集的差异?

5 clojure

例如,如果我有2个包含书签数据的管道分隔文件.如何读入数据然后确定两组数据的差异?

输入集#1:bookmarks.csv

2 | www.cnn.com |新闻|这是CNN
3 | www.msnbc.com |搜索|
4 | news.ycombinator.com |新闻|技术新闻
5 | bing.com |搜索|竞争者

输入集#2:bookmarks2.csv

1 | www.google.com |搜索|搜索之王
2 | www.cnn.com |新闻|这是CNN
3 | www.msnbc.com |搜索|新评论
4 | news.ycombinator.com |新闻|技术新闻

产量

集#1中缺少Id#1
集#2中缺少Id#5
Id#3不同:
 - > www.msnbc.com |搜索|
 - > www.msnbc.com |搜索|新评论

Bri*_*per 5

(use '[clojure.contrib str-utils duck-streams pprint]
     '[clojure set])

(defn read-bookmarks [filename]
  (apply hash-map
         (mapcat #(re-split #"\|" % 2)
                 (read-lines filename))))

(defn diff-bookmarks [filename1 filename2]
  (let [f1 (read-bookmarks filename1)
        f2 (read-bookmarks filename2)
        k1 (set (keys f1))
        k2 (set (keys f2))
        missing-in-1 (difference k2 k1)
        missing-in-2 (difference k1 k2)
        present-but-different (filter #(not= (f1 %) (f2 %))
                                      (intersection k1 k2))]
    (cl-format nil "~{Id #~a is missing in set #1~%~}~{Id #~a is missing in set #2~%~}~{~{Id #~a is different~%  -> ~a~%  -> ~a~%~}~}"
               missing-in-1
               missing-in-2
               (map #(list % (f1 %) (f2 %))
                    present-but-different))))

(print (diff-bookmarks "bookmarks.csv" "bookmarks2.csv"))