Fer*_*enc 3 html web-scraping julia
编辑:......好吧,在@Ismael VC的亲切帮助下变得快速.溶液首先擦我的朱莉娅V0.4,从最近的夜间重新安装它,然后一定量的包杂耍的:Pkg.init(),Pkg.add("Gumbo").添加Gumbo首先会产生构建错误:
INFO: Installing Gumbo v0.1.0
INFO: Building Gumbo
WARNING: deprecated syntax "[a=>b, ...]" at /Users/szalmaf/.julia/v0.4/Gumbo/deps/build.jl:19.
Use "Dict(a=>b, ...)" instead.
INFO: Attempting to Create directory /Users/szalmaf/.julia/v0.4/Gumbo/deps/downloads
INFO: Downloading file http://jamesporter.me/static/julia/gumbo-1.0.tar.gz
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
curl: (22) The requested URL returned error: 404 Not Found
================================[ ERROR: Gumbo ]================================
LoadError: failed process: Process(`curl -f -o /Users/szalmaf/.julia/v0.4/Gumbo/deps/downloads/gumbo-1.0.tar.gz -L http://jamesporter.me/static/julia/gumbo-1.0.tar.gz`, ProcessExited(22)) [22]
while loading /Users/szalmaf/.julia/v0.4/Gumbo/deps/build.jl, in expression starting on line 19
================================================================================
================================[ BUILD ERRORS ]================================
WARNING: Gumbo had build errors.
- packages with build errors remain installed in /Users/szalmaf/.julia/v0.4
- build the package(s) and all dependencies with `Pkg.build("Gumbo")`
- build a single package by running its `deps/build.jl` script
================================================================================
INFO: Package database updated
Run Code Online (Sandbox Code Playgroud)
,所以人们需要从主分支检查出最新的浓汤Pkg.update(),Pkg.build("Gumbo"),这又产生一个浓汤,其parsehtml被速度极快.
注意:问题不在于评论者(他没有仔细阅读以前的评论)提到的内容,即JIT编译器使"它"变慢的说法.如果您阅读我和@Ismael VC之间的来回讨论,您可以看到我按照他的方式运行了他的确切测试代码,并且我在前两条评论中得到了结果,而我的原始安装确实太慢了.无论如何,重要的是,parsehtml在我们的私人聊天中,Ismael帮助的速度和它一样快.再次感谢!
原帖:
Julia 0.4.0-dev + 7053 html解析速度极慢?
虽然Julia语言在许多方面都被快速销售,但在解析网页等基本生活中看起来很慢.
分析http://julialang.org网页,显示Julia对C,Fortran,R,Matlab等的速度有多快.
# using HTTPClient, Gumbo
julia_url = "http://julialang.org"
println(" scrape start: ", Dates.unix2datetime(time()))
julia_pageg = julia_url |> get
println(" scrape end: ", Dates.unix2datetime(time()))
julia_page = julia_pageg |> x->x.body |> bytestring |> parsehtml
println(" parsed: ", Dates.unix2datetime(time()))
Run Code Online (Sandbox Code Playgroud)
给
scrape start: 2015-09-05T16:47:03.843
scrape end: 2015-09-05T16:47:04.044
parsed: 2015-09-05T16:47:04.41
Run Code Online (Sandbox Code Playgroud)
这表明获得这个网页需要大约100毫秒,这对我的wifi连接是合理的,然而,解析这个简单的页面需要大约400毫秒,这听起来像今天的标准.
对更复杂的网页进行相同的测试
julia_url = "http://www.quora.com/How-powerful-and-faster-is-Julia-Language"
println(" scrape start: ", Dates.unix2datetime(time()))
julia_pageg = julia_url |> get
println(" scrape end: ", Dates.unix2datetime(time()))
julia_page = julia_pageg |> x->x.body |> bytestring |> parsehtml
println(" parsed: ", Dates.unix2datetime(time()))
Run Code Online (Sandbox Code Playgroud)
给
scrape start: 2015-09-05T16:57:52.054
scrape end: 2015-09-05T16:57:52.736
parsed: 2015-09-05T16:57:53.699
Run Code Online (Sandbox Code Playgroud)
解析几乎需要一整秒.
我可能遗漏了一些东西,但在Julia解析网页或从中获取html元素有更好/更快的方法吗?如果是这样,怎么样?
首先,你有没有红色的手册中的性能提示?你使用哪个Julia版本?(versioninfo())
你可以从阅读它开始,并将你的代码放在文档中建议的函数中,有一个@time宏,它也提示你内存分配,如下所示:
测试时间:https://juliabox.org
using HTTPClient, Gumbo
function test(url::String)
@show url
print("Scraping: ")
@time page = get(url)
print("Parsing: ")
@time page = parsehtml(bytestring(page.body))
end
let
gc_disable()
url = "http://julialang.org"
println("First run:")
test(url) # first run JITed
println("\nSecond run:")
test(url)
url = "http://www.quora.com/How-powerful-and-faster-is-Julia-Language"
println("\nThird run:")
test(url)
println("\nFourth run:")
test(url)
gc_enable()
end
Run Code Online (Sandbox Code Playgroud)
First run:
url => "http://julialang.org"
Scraping: elapsed time: 0.248092469 seconds (3971912 bytes allocated)
Parsing: elapsed time: 0.850927483 seconds (27207516 bytes allocated)
Second run:
url => "http://julialang.org"
Scraping: elapsed time: 0.055722638 seconds (73952 bytes allocated)
Parsing: elapsed time: 0.005446998 seconds (821800 bytes allocated)
Third run:
url => "http://www.quora.com/How-powerful-and-faster-is-Julia-Language"
Scraping: elapsed time: 0.282382774 seconds (619324 bytes allocated)
Parsing: elapsed time: 0.227427243 seconds (9728620 bytes allocated)
Fourth run:
url => "http://www.quora.com/How-powerful-and-faster-is-Julia-Language"
Scraping: elapsed time: 0.288903961 seconds (400272 bytes allocated)
Parsing: elapsed time: 0.017787089 seconds (1516560 bytes allocated)
Run Code Online (Sandbox Code Playgroud)
@time:julia_url = "http://julialang.org"
@time julia_pageg = julia_url |> get
@time julia_page = julia_pageg |> x->x.body |> bytestring |> parsehtml
Run Code Online (Sandbox Code Playgroud)
第一次运行:
elapsed time: 0.361194892 seconds (11108960 bytes allocated)
elapsed time: 0.996812988 seconds (34546156 bytes allocated, 4.04% gc time)
Run Code Online (Sandbox Code Playgroud)
第二轮:
elapsed time: 0.018920084 seconds (77952 bytes allocated)
elapsed time: 0.006632215 seconds (823256 bytes allocated)
Run Code Online (Sandbox Code Playgroud)
julia_url = "http://www.quora.com/How-powerful-and-faster-is-Julia-Language"
@time julia_pageg = julia_url |> get
@time julia_page = julia_pageg |> x->x.body |> bytestring |> parsehtml
Run Code Online (Sandbox Code Playgroud)
第一次运行:
elapsed time: 0.33795947 seconds (535916 bytes allocated)
elapsed time: 0.224386491 seconds (9729852 bytes allocated)
Run Code Online (Sandbox Code Playgroud)
第二轮:
elapsed time: 0.276848452 seconds (584944 bytes allocated)
elapsed time: 0.018806686 seconds (1517856 bytes allocated)
Run Code Online (Sandbox Code Playgroud)
在0.4+版本中,请确保首先执行a Pkg.checkout("Gumbo")以获取最新提交,然后执行该操作然后Pkg.build("Gumbo")在JuliaBox中执行以下操作:
http://nbviewer.ipython.org/gist/Ismael-VC/4c241228f04ed54c70e2
First run:
url = "http://julialang.org"
Scraping: 0.227681 seconds (85.11 k allocations: 3.585 MB)
Parsing: 0.696063 seconds (799.12 k allocations: 29.450 MB)
Second run:
url = "http://julialang.org"
Scraping: 0.018953 seconds (571 allocations: 69.344 KB)
Parsing: 0.007132 seconds (15.91 k allocations: 916.313 KB)
Third run:
url = "http://www.quora.com/How-powerful-and-faster-is-Julia-Language"
Scraping: 0.313128 seconds (4.86 k allocations: 608.850 KB)
Parsing: 0.196110 seconds (270.17 k allocations: 10.356 MB)
Fourth run:
url = "http://www.quora.com/How-powerful-and-faster-is-Julia-Language"
Scraping: 0.307949 seconds (1.41 k allocations: 470.953 KB)
Parsing: 0.019801 seconds (23.82 k allocations: 1.627 MB)
Run Code Online (Sandbox Code Playgroud)