use*_*839 3 clojure hashmap markov
如果我有一个单词的向量,例如["john""说"......"john""走了"......]我想制作每个单词的哈希映射和下一个单词的出现次数,例如{"john"{"说"1"走了"1"踢了"3}}
我想出的最好的解决方案是通过索引递归遍历列表并使用assoc来继续更新哈希映射,但这似乎非常混乱.有没有更惯用的方法呢?
鉴于你有话:
(def words ["john" "said" "lara" "chased" "john" "walked" "lara" "chased"])
Run Code Online (Sandbox Code Playgroud)
使用此转换-fn
(defn transform
[words]
(->> words
(partition 2 1)
(reduce (fn [acc [w next-w]]
;; could be shortened to #(update-in %1 %2 (fnil inc 0))
(update-in acc
[w next-w]
(fnil inc 0)))
{})))
(transform words)
;; {"walked" {"lara" 1}, "chased" {"john" 1}, "lara" {"chased" 2}, "said" {"lara" 1}, "john" {"walked" 1, "said" 1}}
Run Code Online (Sandbox Code Playgroud)
编辑:您可以使用像这样的瞬态哈希映射来获得性能:
(defn transform-fast
[words]
(->> (map vector words (next words))
(reduce (fn [acc [w1 w2]]
(let [c-map (get acc w1 (transient {}))]
(assoc! acc w1 (assoc! c-map w2
(inc (get c-map w2 0))))))
(transient {}))
persistent!
(reduce-kv (fn [acc w1 c-map]
(assoc! acc w1 (persistent! c-map)))
(transient {}))
persistent!))
Run Code Online (Sandbox Code Playgroud)
显然,生成的源代码看起来不太好,只有在关键时才会发生这种优化.
(Criterium表示,它击败MichałMarczyks的transform*速度大约是李尔王的两倍).
(更新:请参阅下面java.util.HashMap的中间产品使用的解决方案- 最终结果仍然完全持久 - 这是最快的,比transform-fastKing Lear基准测试的优势高2.35倍.)
merge-with基于解决方案这是一个更快的解决方案,从李尔王(Lee Lear)获得的单词大约1.7倍(参见下面的确切方法),几乎是样本的3倍words:
(defn transform* [words]
(apply merge-with
#(merge-with + %1 %2)
(map (fn [w nw] {w {nw 1}})
words
(next words))))
Run Code Online (Sandbox Code Playgroud)
map可以替代地写入传递给的函数
#(array-map %1 (array-map %2 1)),
Run Code Online (Sandbox Code Playgroud)
虽然采用这种方法的时间并不是那么好.(我仍然在下面的基准测试中包含此版本transform**.)
首先,进行健全检查:
;; same input
(def words ["john" "said" "lara" "chased" "john"
"walked" "lara" "chased"])
(= (transform words) (transform* words) (transform** words))
;= true
Run Code Online (Sandbox Code Playgroud)
使用测试输入的标准基准(OpenJDK 1.7 with -XX:+UseConcMarkSweepGC):
(do (c/bench (transform words))
(flush)
(c/bench (transform* words))
(flush)
(c/bench (transform** words)))
Evaluation count : 4345080 in 60 samples of 72418 calls.
Execution time mean : 13.945669 µs
Execution time std-deviation : 158.808075 ns
Execution time lower quantile : 13.696874 µs ( 2.5%)
Execution time upper quantile : 14.295440 µs (97.5%)
Overhead used : 1.612143 ns
Found 2 outliers in 60 samples (3.3333 %)
low-severe 2 (3.3333 %)
Variance from outliers : 1.6389 % Variance is slightly inflated by outliers
Evaluation count : 12998220 in 60 samples of 216637 calls.
Execution time mean : 4.705608 µs
Execution time std-deviation : 63.133406 ns
Execution time lower quantile : 4.605234 µs ( 2.5%)
Execution time upper quantile : 4.830540 µs (97.5%)
Overhead used : 1.612143 ns
Found 1 outliers in 60 samples (1.6667 %)
low-severe 1 (1.6667 %)
Variance from outliers : 1.6389 % Variance is slightly inflated by outliers
Evaluation count : 10847220 in 60 samples of 180787 calls.
Execution time mean : 5.706852 µs
Execution time std-deviation : 73.589941 ns
Execution time lower quantile : 5.560404 µs ( 2.5%)
Execution time upper quantile : 5.828209 µs (97.5%)
Overhead used : 1.612143 ns
Run Code Online (Sandbox Code Playgroud)
最后,使用King Lear 在Project Gutenberg上找到的更有趣的基准(在处理之前没有费心去除法律声明等):
(def king-lear (slurp (io/file "/path/to/pg1128.txt")))
(def king-lear-words
(-> king-lear
(string/lower-case)
(string/replace #"[^a-z]" " ")
(string/trim)
(string/split #"\s+")))
(do (c/bench (transform king-lear-words))
(flush)
(c/bench (transform* king-lear-words))
(flush)
(c/bench (transform** king-lear-words)))
Evaluation count : 720 in 60 samples of 12 calls.
Execution time mean : 87.012898 ms
Execution time std-deviation : 833.381589 µs
Execution time lower quantile : 85.772832 ms ( 2.5%)
Execution time upper quantile : 88.585741 ms (97.5%)
Overhead used : 1.612143 ns
Evaluation count : 1200 in 60 samples of 20 calls.
Execution time mean : 51.786860 ms
Execution time std-deviation : 587.029829 µs
Execution time lower quantile : 50.854355 ms ( 2.5%)
Execution time upper quantile : 52.940274 ms (97.5%)
Overhead used : 1.612143 ns
Evaluation count : 1020 in 60 samples of 17 calls.
Execution time mean : 61.287369 ms
Execution time std-deviation : 720.816107 µs
Execution time lower quantile : 60.131219 ms ( 2.5%)
Execution time upper quantile : 62.960647 ms (97.5%)
Overhead used : 1.612143 ns
Run Code Online (Sandbox Code Playgroud)
java.util.HashMap基于解决方案全力以赴,可以更好地使用可变哈希映射用于中间状态和loop/ recur避免在循环对词对时进行:
(defn t9 [words]
(let [m (java.util.HashMap.)]
(loop [ws words
nws (next words)]
(if nws
(let [w (first ws)
nw (first nws)]
(if-let [im ^java.util.HashMap (.get m w)]
(.put im nw (inc (or (.get im nw) 0)))
(let [im (java.util.HashMap.)]
(.put im nw 1)
(.put m w im)))
(recur (next ws) (next nws)))
(persistent!
(reduce (fn [out k]
(assoc! out k
(clojure.lang.PersistentHashMap/create
^java.util.HashMap (.get m k))))
(transient {})
(iterator-seq (.iterator (.keySet m)))))))))
Run Code Online (Sandbox Code Playgroud)
clojure.lang.PersistentHashMap/create是类中的静态方法,PHM无疑是一个实现细节.(但不太可能在不久的将来改变 - 目前在Clojure中为内置地图类型创建的所有地图都通过这样的静态方法.)
完整性检查:
(= (transform king-lear-words) (t9 king-lear-words))
;= true
Run Code Online (Sandbox Code Playgroud)
基准测试结果:
(c/bench (transform-fast king-lear-words))
Evaluation count : 2100 in 60 samples of 35 calls.
Execution time mean : 28.560527 ms
Execution time std-deviation : 262.483916 µs
Execution time lower quantile : 28.117982 ms ( 2.5%)
Execution time upper quantile : 29.104784 ms (97.5%)
Overhead used : 1.898836 ns
(c/bench (t9 king-lear-words))
Evaluation count : 4980 in 60 samples of 83 calls.
Execution time mean : 12.153898 ms
Execution time std-deviation : 119.028100 µs
Execution time lower quantile : 11.953013 ms ( 2.5%)
Execution time upper quantile : 12.411588 ms (97.5%)
Overhead used : 1.898836 ns
Found 1 outliers in 60 samples (1.6667 %)
low-severe 1 (1.6667 %)
Variance from outliers : 1.6389 % Variance is slightly inflated by outliers
Run Code Online (Sandbox Code Playgroud)