我正在尝试使用Scalpel刮擦网站,但使用他们自己的示例代码遇到了超出范围的错误。该示例可在其github页面上的“ 我的抓取目标未返回预期的标记”部分中找到。
我正在使用ghc-8.6.4Haskell编译器。
我的packages.yaml依赖项是:
dependencies:
- base >= 4.7 && < 5
- http-conduit
- http-client
- http-client-tls
- http-types
- scalpel
Run Code Online (Sandbox Code Playgroud)
代码:
{-# LANGUAGE NamedFieldPuns #-}
{-# LANGUAGE OverloadedStrings #-}
module Example where
import Text.HTML.Scalpel
import qualified Network.HTTP.Client as HTTP
import qualified Network.HTTP.Client.TLS as HTTP
import qualified Network.HTTP.Types.Header as HTTP
-- Create a new manager settings based on the default TLS manager that updates
-- the request headers to include a custom user agent.
managerSettings :: HTTP.ManagerSettings
managerSettings = HTTP.tlsManagerSettings {
HTTP.managerModifyRequest = \req -> do
req' <- HTTP.managerModifyRequest HTTP.tlsManagerSettings req
return $ req' {
HTTP.requestHeaders = (HTTP.hUserAgent, "My Custom UA")
: HTTP.requestHeaders req'
}
}
main = do
manager <- Just <$> HTTP.newManager managerSettings
html <- scrapeURLWithConfig (def { manager }) url $ htmls anySelector
maybe printError printHtml html
where
url = "https://www.google.com"
printError = putStrLn "Failed"
printHtml = mapM_ putStrLn
Run Code Online (Sandbox Code Playgroud)
从代码示例中可以看到,manager常量位于def函数旁边。但是似乎它manager以某种方式隐藏着……我不能把手指放在哪里出了问题。
该stack build命令的整个控制台输出,其中包含报告的错误:
jroyer$ stack build
my-okr-haskeller-0.1.0.0: build (lib + exe)
Preprocessing library for my-okr-haskeller-0.1.0.0..
Building library for my-okr-haskeller-0.1.0.0..
[2 of 3] Compiling Example ( src/Example.hs, .stack-work/dist/x86_64-osx/Cabal-2.4.0.1/build/Example.o )
/Users/jroyer/Projects/bizgithub/my-okr-haskeller/src/Example.hs:26:40: error: Not in scope: ‘manager’
|
26 | html <- scrapeURLWithConfig (def { manager }) url $ htmls anySelector
| ^^^^^^^
-- While building package my-okr-haskeller-0.1.0.0 using:
/Users/jroyer/.stack/setup-exe-cache/x86_64-osx/Cabal-simple_mPHDZzAJ_2.4.0.1_ghc-8.6.4 --builddir=.stack-work/dist/x86_64-osx/Cabal-2.4.0.1 build lib:my-okr-haskeller exe:my-okr-haskeller-exe --ghc-options " -ddump-hi -ddump-to-file -fdiagnostics-color=always"
Process exited with code: ExitFailure 1
Run Code Online (Sandbox Code Playgroud)
编辑:我可以用旧版本的手术刀重现质问者的问题,质问者提到他们正在使用:
[1 of 1] Compiling Example ( Main.hs, /var/folders/m7/_2kqsz4n4c3ck8050glq4ggr0000gn/T/cabal-repl.-26184/dist-newstyle/build/x86_64-osx/ghc-8.6.4/fake-package-0/x/script/build/script/script-tmp/Example.o )
Main.hs:34:40: error: Not in scope: ‘manager’
|
34 | html <- scrapeURLWithConfig (def { manager }) url $ htmls anySelector
| ^^^^^^^
./so.hs 16.94s user 3.89s system 114% cpu 18.155 total
Run Code Online (Sandbox Code Playgroud)
这是次优的错误消息,似乎是由于使用命名字段双关语和不是字段名称的变量导致的。也就是说,Config在该版本中scalpel没有管理员字段。我们可以在一个较小的示例中重现此问题:
% cat test.hs
{-# LANGUAGE NamedFieldPuns #-}
data Foo = Foo { bar :: Int } deriving (Show)
main :: IO ()
main = print (Foo { zar})
where zar = 23 :: Int
% ghc test.hs
...snipt...
test.hs:4:21: error:
Not in scope: ‘zar’
Perhaps you meant ‘bar’ (line 3)
|
4 | main = print (Foo { zar})
Run Code Online (Sandbox Code Playgroud)
因此,解决方案是将手术刀更新为新版本。
html <-scrapeURLWithConfig(def {manager})url $ htmls anySelector
我不知道这应该是什么。具体来说(def { manager })。我不熟悉任何语法。
如果有manager,应该有一个字段。例如:
def { someField = someValue }
Run Code Online (Sandbox Code Playgroud)
不是你所拥有的def { someValue }没有任何意义。
啊,NamedFieldPuns。老实说,我从未使用过它们,看着它们,我发现自己在使用RecordWildCards。继续。
查看黑线码头,字段名称是,manager所以您有一个manager字段和一个manager名称字段pun 的值。我需要为添加一个导入def。同时,我自由地使用cabal和shebang来明确说明所有软件包:
#! /usr/bin/env cabal
{- cabal:
build-depends:
base >= 4
, scalpel == 0.6.0
, http-types == 0.12.3
, http-client-tls == 0.3.5.3
, http-client == 0.6.4
, data-default == 0.7.1.1
-}
{-# LANGUAGE NamedFieldPuns #-}
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Data.Default
import Text.HTML.Scalpel
import qualified Network.HTTP.Client as HTTP
import qualified Network.HTTP.Client.TLS as HTTP
import qualified Network.HTTP.Types.Header as HTTP
-- Create a new manager settings based on the default TLS manager that updates
-- the request headers to include a custom user agent.
managerSettings :: HTTP.ManagerSettings
managerSettings = HTTP.tlsManagerSettings {
HTTP.managerModifyRequest = \req -> do
req' <- HTTP.managerModifyRequest HTTP.tlsManagerSettings req
return $ req' {
HTTP.requestHeaders = (HTTP.hUserAgent, "My Custom UA")
: HTTP.requestHeaders req'
}
}
main = do
manager <- Just <$> HTTP.newManager managerSettings
html <- scrapeURLWithConfig (def { manager = manager }) url $ htmls anySelector
maybe printError printHtml html
where
url = "https://www.google.com"
printError = putStrLn "Failed"
printHtml = mapM_ putStrLn
Run Code Online (Sandbox Code Playgroud)
似乎运行良好。请注意,包含的模块main本身应为Main。