lui*_*iel 8 text-processing clojure lazy-sequences
我正在Clojure中编写一个简单的桌面搜索引擎,以此来了解有关该语言的更多信息.到目前为止,我的程序文本处理阶段的表现非常糟糕.
在文本处理期间,我要:
这是代码:
(ns txt-processing.core
(:require [clojure.java.io :as cjio])
(:require [clojure.string :as cjstr])
(:gen-class))
(defn all-files [path]
(let [entries (file-seq (cjio/file path))]
(filter (memfn isFile) entries)))
(def char-val
(let [value #(Character/getNumericValue %)]
{:a (value \a) :z (value \z)
:A (value \A) :Z (value \Z)
:0 (value \0) :9 (value \9)}))
(defn is-ascii-alpha-num [c]
(let [n (Character/getNumericValue c)]
(or (and (>= n (char-val :a)) (<= n (char-val :z)))
(and (>= n (char-val :A)) (<= n (char-val :Z)))
(and (>= n (char-val :0)) (<= n (char-val :9))))))
(defn is-valid [c]
(or (is-ascii-alpha-num c)
(Character/isSpaceChar c)
(.equals (str \newline) (str c))))
(defn lower-and-replace [c]
(if (.equals (str \newline) (str c)) \space (Character/toLowerCase c)))
(defn tokenize [content]
(let [filtered (filter is-valid content)
lowered (map lower-and-replace filtered)]
(cjstr/split (apply str lowered) #"\s+")))
(defn process-content [content]
(let [words (tokenize content)]
(loop [ws words i 0 hmap (hash-map)]
(if (empty? ws)
hmap
(recur (rest ws) (+ i 1) (update-in hmap [(first ws)] #(conj % i)))))))
(defn -main [& args]
(doseq [file (all-files (first args))]
(let [content (slurp file)
oc-list (process-content content)]
(println "File:" (.getPath file)
"| Words to be indexed:" (count oc-list )))))
Run Code Online (Sandbox Code Playgroud)
由于我在Haskell中有另一个实现此问题的方法,因此我在以下输出中进行了比较.
Clojure版本:
$ lein uberjar
Compiling txt-processing.core
Created /home/luisgabriel/projects/txt-processing/clojure/target/txt-processing-0.1.0-SNAPSHOT.jar
Including txt-processing-0.1.0-SNAPSHOT.jar
Including clojure-1.5.1.jar
Created /home/luisgabriel/projects/txt-processing/clojure/target/txt-processing-0.1.0-SNAPSHOT-standalone.jar
$ time java -jar target/txt-processing-0.1.0-SNAPSHOT-standalone.jar ../data
File: ../data/The.Rat.Racket.by.David.Henry.Keller.txt | Words to be indexed: 2033
File: ../data/Beyond.Pandora.by.Robert.J.Martin.txt | Words to be indexed: 1028
File: ../data/Bat.Wing.by.Sax.Rohmer.txt | Words to be indexed: 7562
File: ../data/Operation.Outer.Space.by.Murray.Leinster.txt | Words to be indexed: 7754
File: ../data/The.Reign.of.Mary.Tudor.by.James.Anthony.Froude.txt | Words to be indexed: 15418
File: ../data/.directory | Words to be indexed: 3
File: ../data/Home.Life.in.Colonial.Days.by.Alice.Morse.Earle.txt | Words to be indexed: 12191
File: ../data/The.Dark.Door.by.Alan.Edward.Nourse.txt | Words to be indexed: 2378
File: ../data/Storm.Over.Warlock.by.Andre.Norton.txt | Words to be indexed: 7451
File: ../data/A.Brief.History.of.the.United.States.by.John.Bach.McMaster.txt | Words to be indexed: 11049
File: ../data/The.Jesuits.in.North.America.in.the.Seventeenth.Century.by.Francis.Parkman.txt | Words to be indexed: 14721
File: ../data/Queen.Victoria.by.Lytton.Strachey.txt | Words to be indexed: 10494
File: ../data/Crime.and.Punishment.by.Fyodor.Dostoyevsky.txt | Words to be indexed: 10642
real 2m2.164s
user 2m3.868s
sys 0m0.978s
Run Code Online (Sandbox Code Playgroud)
Haskell版本:
$ ghc -rtsopts --make txt-processing.hs
[1 of 1] Compiling Main ( txt-processing.hs, txt-processing.o )
Linking txt-processing ...
$ time ./txt-processing ../data/ +RTS -K12m
File: ../data/The.Rat.Racket.by.David.Henry.Keller.txt | Words to be indexed: 2033
File: ../data/Beyond.Pandora.by.Robert.J.Martin.txt | Words to be indexed: 1028
File: ../data/Bat.Wing.by.Sax.Rohmer.txt | Words to be indexed: 7562
File: ../data/Operation.Outer.Space.by.Murray.Leinster.txt | Words to be indexed: 7754
File: ../data/The.Reign.of.Mary.Tudor.by.James.Anthony.Froude.txt | Words to be indexed: 15418
File: ../data/.directory | Words to be indexed: 3
File: ../data/Home.Life.in.Colonial.Days.by.Alice.Morse.Earle.txt | Words to be indexed: 12191
File: ../data/The.Dark.Door.by.Alan.Edward.Nourse.txt | Words to be indexed: 2378
File: ../data/Storm.Over.Warlock.by.Andre.Norton.txt | Words to be indexed: 7451
File: ../data/A.Brief.History.of.the.United.States.by.John.Bach.McMaster.txt | Words to be indexed: 11049
File: ../data/The.Jesuits.in.North.America.in.the.Seventeenth.Century.by.Francis.Parkman.txt | Words to be indexed: 14721
File: ../data/Queen.Victoria.by.Lytton.Strachey.txt | Words to be indexed: 10494
File: ../data/Crime.and.Punishment.by.Fyodor.Dostoyevsky.txt | Words to be indexed: 10642
real 0m9.086s
user 0m8.591s
sys 0m0.463s
Run Code Online (Sandbox Code Playgroud)
我认为Clojure实现中的(字符串 - > 延迟序列)转换会破坏性能.我怎样才能改进它?
PS:这些测试中使用的所有代码和数据都可以在这里下载.
您可以做的一些事情可能会加快此代码的速度:
1)无需将您的映射chars
到char-val
,只需在字符之间进行直接值比较即可。其速度更快的原因与 Java 中速度更快的原因相同。
2) 您重复使用str
将单字符值转换为完整的字符串。再次考虑直接使用字符值。同样,对象创建速度很慢,与 Java 中一样。
3)您应该替换process-content
为clojure.core/frequencies
. 也许检查frequencies
源代码看看它是如何更快的。
4) 如果必须(hash-map)
在循环中更新 a,请使用transient
. 请参阅: http: //clojuredocs.org/clojure_core/clojure.core/transient
另请注意,(hash-map)
返回 a PersistentArrayMap
,因此您每次调用都会创建新实例update-in
- 因此速度很慢,以及为什么您应该使用瞬态。
5) 这是你的朋友:(set! *warn-on-reflection* true)
- 你有相当多的反思可以从类型提示中受益
Reflection warning, scratch.clj:10:13 - call to isFile can't be resolved.
Reflection warning, scratch.clj:13:16 - call to getNumericValue can't be resolved.
Reflection warning, scratch.clj:19:11 - call to getNumericValue can't be resolved.
Reflection warning, scratch.clj:26:9 - call to isSpaceChar can't be resolved.
Reflection warning, scratch.clj:30:47 - call to toLowerCase can't be resolved.
Reflection warning, scratch.clj:48:24 - reference to field getPath can't be resolved.
Reflection warning, scratch.clj:48:24 - reference to field getPath can't be resolved.
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
1410 次 |
最近记录: |