Julia 0.4.0-dev + 7053 html解析非常快

Fer*_*enc 3 html web-scraping julia

编辑:......好吧,在@Ismael VC的亲切帮助下变得快速.溶液首先擦我的朱莉娅V0.4,从最近的夜间重新安装它,然后一定量的包杂耍的:Pkg.init(),Pkg.add("Gumbo").添加Gumbo首先会产生构建错误:

INFO: Installing Gumbo v0.1.0
INFO: Building Gumbo

WARNING: deprecated syntax "[a=>b, ...]" at /Users/szalmaf/.julia/v0.4/Gumbo/deps/build.jl:19.
Use "Dict(a=>b, ...)" instead.
INFO: Attempting to Create directory /Users/szalmaf/.julia/v0.4/Gumbo/deps/downloads
INFO: Downloading file http://jamesporter.me/static/julia/gumbo-1.0.tar.gz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
curl: (22) The requested URL returned error: 404 Not Found
================================[ ERROR: Gumbo ]================================

LoadError: failed process: Process(`curl -f -o /Users/szalmaf/.julia/v0.4/Gumbo/deps/downloads/gumbo-1.0.tar.gz -L http://jamesporter.me/static/julia/gumbo-1.0.tar.gz`, ProcessExited(22)) [22]
while loading /Users/szalmaf/.julia/v0.4/Gumbo/deps/build.jl, in expression starting on line 19

================================================================================

================================[ BUILD ERRORS ]================================

WARNING: Gumbo had build errors.

 - packages with build errors remain installed in /Users/szalmaf/.julia/v0.4
 - build the package(s) and all dependencies with `Pkg.build("Gumbo")`
 - build a single package by running its `deps/build.jl` script

================================================================================
INFO: Package database updated
Run Code Online (Sandbox Code Playgroud)

,所以人们需要从主分支检查出最新的浓汤Pkg.update(),Pkg.build("Gumbo"),这又产生一个浓汤,其parsehtml被速度极快.

注意:问题不在于评论者(他没有仔细阅读以前的评论)提到的内容,即JIT编译器使"它"变慢的说法.如果您阅读我和@Ismael VC之间的来回讨论,您可以看到我按照他的方式运行了他的确切测试代码,并且我在前两条评论中得到了结果,而我的原始安装确实太慢了.无论如何,重要的是,parsehtml在我们的私人聊天中,Ismael帮助的速度和它一样快.再次感谢!


原帖:

Julia 0.4.0-dev + 7053 html解析速度极慢?

虽然Julia语言在许多方面都被快速销售,但在解析网页等基本生活中看起来很慢.

分析http://julialang.org网页,显示Julia对C,Fortran,R,Matlab等的速度有多快.

# using HTTPClient, Gumbo
julia_url = "http://julialang.org"
println("  scrape start: ", Dates.unix2datetime(time()))
julia_pageg = julia_url |> get
println("  scrape end: ", Dates.unix2datetime(time()))
julia_page = julia_pageg |> x->x.body |> bytestring |> parsehtml
println("  parsed: ", Dates.unix2datetime(time()))
Run Code Online (Sandbox Code Playgroud)

scrape start: 2015-09-05T16:47:03.843
scrape end: 2015-09-05T16:47:04.044
parsed: 2015-09-05T16:47:04.41
Run Code Online (Sandbox Code Playgroud)

这表明获得这个网页需要大约100毫秒,这对我的wifi连接是合理的,然而,解析这个简单的页面需要大约400毫秒,这听起来像今天的标准.

对更复杂的网页进行相同的测试

julia_url = "http://www.quora.com/How-powerful-and-faster-is-Julia-Language"
println("  scrape start: ", Dates.unix2datetime(time()))
julia_pageg = julia_url |> get
println("  scrape end: ", Dates.unix2datetime(time()))
julia_page = julia_pageg |> x->x.body |> bytestring |> parsehtml
println("  parsed: ", Dates.unix2datetime(time()))
Run Code Online (Sandbox Code Playgroud)

scrape start: 2015-09-05T16:57:52.054
scrape end: 2015-09-05T16:57:52.736
parsed: 2015-09-05T16:57:53.699
Run Code Online (Sandbox Code Playgroud)

解析几乎需要一整秒.

我可能遗漏了一些东西,但在Julia解析网页或从中获取html元素有更好/更快的方法吗?如果是这样,怎么样?

Sal*_*apa 5

首先,你有没有红色的手册中的性能提示?你使用哪个Julia版本?(versioninfo())

你可以从阅读它开始,并将你的代码放在文档中建议的函数中,有一个@time宏,它也提示你内存分配,如下所示:

朱莉娅v0.3.11

测试时间:https://juliabox.org

using HTTPClient, Gumbo

function test(url::String)
    @show url

    print("Scraping: ")
    @time page = get(url)

    print("Parsing: ")
    @time page = parsehtml(bytestring(page.body))
end

let
    gc_disable()

    url =  "http://julialang.org"

    println("First run:")
    test(url)    # first run JITed

    println("\nSecond run:")
    test(url)

    url = "http://www.quora.com/How-powerful-and-faster-is-Julia-Language"    

    println("\nThird run:")
    test(url)

    println("\nFourth run:")
    test(url)

    gc_enable()
end
Run Code Online (Sandbox Code Playgroud)
First run:
url => "http://julialang.org"
Scraping: elapsed time: 0.248092469 seconds (3971912 bytes allocated)
Parsing: elapsed time: 0.850927483 seconds (27207516 bytes allocated)

Second run:
url => "http://julialang.org"
Scraping: elapsed time: 0.055722638 seconds (73952 bytes allocated)
Parsing: elapsed time: 0.005446998 seconds (821800 bytes allocated)

Third run:
url => "http://www.quora.com/How-powerful-and-faster-is-Julia-Language"
Scraping: elapsed time: 0.282382774 seconds (619324 bytes allocated)
Parsing: elapsed time: 0.227427243 seconds (9728620 bytes allocated)

Fourth run:
url => "http://www.quora.com/How-powerful-and-faster-is-Julia-Language"
Scraping: elapsed time: 0.288903961 seconds (400272 bytes allocated)
Parsing: elapsed time: 0.017787089 seconds (1516560 bytes allocated)
Run Code Online (Sandbox Code Playgroud)

这是您的代码的时间@time:

julia_url = "http://julialang.org"
@time julia_pageg = julia_url |> get
@time julia_page = julia_pageg |> x->x.body |> bytestring |> parsehtml
Run Code Online (Sandbox Code Playgroud)

第一次运行:

elapsed time: 0.361194892 seconds (11108960 bytes allocated)
elapsed time: 0.996812988 seconds (34546156 bytes allocated, 4.04% gc time)
Run Code Online (Sandbox Code Playgroud)

第二轮:

elapsed time: 0.018920084 seconds (77952 bytes allocated)
elapsed time: 0.006632215 seconds (823256 bytes allocated)
Run Code Online (Sandbox Code Playgroud)
julia_url = "http://www.quora.com/How-powerful-and-faster-is-Julia-Language"
@time julia_pageg = julia_url |> get
@time julia_page = julia_pageg |> x->x.body |> bytestring |> parsehtml
Run Code Online (Sandbox Code Playgroud)

第一次运行:

elapsed time: 0.33795947 seconds (535916 bytes allocated)
elapsed time: 0.224386491 seconds (9729852 bytes allocated)
Run Code Online (Sandbox Code Playgroud)

第二轮:

elapsed time: 0.276848452 seconds (584944 bytes allocated)
elapsed time: 0.018806686 seconds (1517856 bytes allocated)
Run Code Online (Sandbox Code Playgroud)

编辑v0.4.0-dev + 7053

在0.4+版本中,请确保首先执行a Pkg.checkout("Gumbo")以获取最新提交,然后执行该操作然后Pkg.build("Gumbo")在JuliaBox中执行以下操作:

http://nbviewer.ipython.org/gist/Ismael-VC/4c241228f04ed54c70e2

First run:
url = "http://julialang.org"
Scraping:   0.227681 seconds (85.11 k allocations: 3.585 MB)
Parsing:   0.696063 seconds (799.12 k allocations: 29.450 MB)

Second run:
url = "http://julialang.org"
Scraping:   0.018953 seconds (571 allocations: 69.344 KB)
Parsing:   0.007132 seconds (15.91 k allocations: 916.313 KB)

Third run:
url = "http://www.quora.com/How-powerful-and-faster-is-Julia-Language"
Scraping:   0.313128 seconds (4.86 k allocations: 608.850 KB)
Parsing:   0.196110 seconds (270.17 k allocations: 10.356 MB)

Fourth run:
url = "http://www.quora.com/How-powerful-and-faster-is-Julia-Language"
Scraping:  0.307949 seconds (1.41 k allocations: 470.953 KB)
Parsing:   0.019801 seconds (23.82 k allocations: 1.627 MB)
Run Code Online (Sandbox Code Playgroud)

  • Sfz,看起来你正在重新启动Julia.每次重启时,Julia都会再次JIT编译代码.相反,如果您在REPL或IJulia中运行两组时序,您应该看到与@Ismael VC类似的时序. (3认同)