专家F#网络爬虫示例

Ben*_*ins 5 f# ctp

我正在尝试使用基于v1.9.2的Expert F#中的示例,但之后的CTP版本已经发生了足够的变化,其中一些甚至不再编译.

我在列出13-13时遇到了一些麻烦.这是urlCollector对象定义的片段:

let urlCollector =
    MailboxProcessor.Start(fun self ->
        let rec waitForUrl (visited : Set<string>) =
            async { if visited.Count < limit then
                        let! url = self.Receive()
                        if not (visited.Contains(url)) then
                            do! Async.Start
                                (async { let! links = collectLinks url
                                         for link in links do
                                         do self <-- link })

                        return! waitForUrl(visited.Add(url)) }

            waitForUrl(Set.Empty))
Run Code Online (Sandbox Code Playgroud)

我正在使用版本1.9.6.16进行编译,编译器就这样抱怨:

  1. 表达式中此点或之前的不完整结构化构造[在最后一个paren之后]
  2. 这个'let'的返回表达式中的错误.可能的错误缩进[指的是let定义waitForUrl]

谁能发现这里出了什么问题?

Bri*_*ian 3

看起来最后一行需要取消缩进 4 个空格。

编辑:实际上,看起来这里发生了更多事情。假设这与此处的示例相同,那么这是我刚刚修改为与 1.9.6.16 版本同步的版本:

open System.Collections.Generic
open System.Net
open System.IO
open System.Threading
open System.Text.RegularExpressions

let limit = 10    

let linkPat = "href=\s*\"[^\"h]*(http://[^&\"]*)\""
let getLinks (txt:string) =
    [ for m in Regex.Matches(txt,linkPat)  -> m.Groups.Item(1).Value ]

let (<--) (mp: MailboxProcessor<_>) x = mp.Post(x)

// A type that helps limit the number of active web requests
type RequestGate(n:int) =
    let semaphore = new Semaphore(initialCount=n,maximumCount=n)
    member x.AcquireAsync(?timeout) =
        async { let! ok = semaphore.AsyncWaitOne(?millisecondsTimeout=timeout)
                if ok then
                   return
                     { new System.IDisposable with
                         member x.Dispose() =
                             semaphore.Release() |> ignore }
                else
                   return! failwith "couldn't acquire a semaphore" }

// Gate the number of active web requests
let webRequestGate = RequestGate(5)

// Fetch the URL, and post the results to the urlCollector.
let collectLinks (url:string) =
    async { // An Async web request with a global gate
            let! html =
                async { // Acquire an entry in the webRequestGate. Release
                        // it when 'holder' goes out of scope
                        use! holder = webRequestGate.AcquireAsync()

                        // Wait for the WebResponse
                        let req = WebRequest.Create(url,Timeout=5)

                        use! response = req.AsyncGetResponse()

                        // Get the response stream
                        use reader = new StreamReader(
                            response.GetResponseStream())

                        // Read the response stream
                        return! reader.AsyncReadToEnd()  }

            // Compute the links, synchronously
            let links = getLinks html

            // Report, synchronously
            do printfn "finished reading %s, got %d links" 
                    url (List.length links)

            // We're done
            return links }

let urlCollector =
    MailboxProcessor.Start(fun self ->
        let rec waitForUrl (visited : Set<string>) =
            async { if visited.Count < limit then
                        let! url = self.Receive()
                        if not (visited.Contains(url)) then
                            Async.Start 
                                (async { let! links = collectLinks url
                                         for link in links do
                                             do self <-- link })
                        return! waitForUrl(visited.Add(url)) }

        waitForUrl(Set.Empty))

urlCollector <-- "http://news.google.com"
// wait for keypress to end program
System.Console.ReadKey() |> ignore
Run Code Online (Sandbox Code Playgroud)