我正在尝试抓取Google Scholar搜索结果,并获取与搜索匹配的每个结果的所有BiBTeX格式。现在,我有一个带有Splash的Scrapy爬虫。我有一个lua脚本,它将在获取href
BibTeX格式的引用之前单击“引用”链接并加载模式窗口。但是看到有多个搜索结果,因此有多个“引用”链接,我需要全部单击它们并加载各个BibTeX页面。
这是我所拥有的:
import scrapy
from scrapy_splash import SplashRequest
class CiteSpider(scrapy.Spider):
name = "cite"
allowed_domains = ["scholar.google.com", "scholar.google.ae"]
start_urls = [
'https://scholar.google.ae/scholar?q="thermodynamics"&hl=en'
]
script = """
function main(splash)
local url = splash.args.url
assert(splash:go(url))
assert(splash:wait(0.5))
splash:runjs('document.querySelectorAll("a.gs_nph[aria-controls=gs_cit]")[0].click()')
splash:wait(3)
local href = splash:evaljs('document.querySelectorAll(".gs_citi")[0].href')
assert(splash:go(href))
return {
html = splash:html(),
png = splash:png(),
href=href,
}
end
"""
def parse(self, response):
yield SplashRequest(self.start_urls[0], self.parse_bib,
endpoint="execute",
args={"lua_source": self.script})
def parse_bib(self, response):
filename = response.url.split("/")[-2] + '.html'
with open(filename, 'wb') as f:
f.write(response.css("body …
Run Code Online (Sandbox Code Playgroud) 我使用scrapy-splash来构建我的蜘蛛.现在我需要的是维护会话,所以我使用scrapy.downloadermiddlewares.cookies.CookiesMiddleware并处理set-cookie标头.我知道它处理set-cookie标头,因为我设置了COOKIES_DEBUG = True,这导致CookeMiddleware关于set-cookie标头的打印输出.
问题是:当我还在图片中添加Splash时,set-cookie打印输出消失,实际上我得到的响应标题是{'Date':['Sun,2016年9月25日12:09:55 GMT'],' Content-Type':['text/html; charset = utf-8'],'Server':['TwistedWeb/16.1.1']}这与使用TwistedWeb的splash渲染引擎有关.
是否有任何指令告诉飞溅也给我原始的响应标题?
我想对通过在网页中向下滚动生成的内容进行逆向工程.问题出在网址上https://www.crowdfunder.com/user/following_page/80159?user_id=80159&limit=0&per_page=20&screwrand=933
.screwrand
似乎没有遵循任何模式,因此撤销网址不起作用.我正在考虑使用Splash进行自动渲染.如何使用Splash滚动浏览器?非常感谢!以下是两个请求的代码:
request1 = scrapy_splash.SplashRequest('https://www.crowdfunder.com/user/following/{}'.format(user_id),
self.parse_follow_relationship,
args={'wait':2},
meta={'user_id':user_id, 'action':'following'},
endpoint='http://192.168.99.100:8050/render.html')
yield request1
request2 = scrapy_splash.SplashRequest('https://www.crowdfunder.com/user/following_user/80159?user_id=80159&limit=0&per_page=20&screwrand=76',
self.parse_tmp,
meta={'user_id':user_id, 'action':'following'},
endpoint='http://192.168.99.100:8050/render.html')
yield request2
Run Code Online (Sandbox Code Playgroud)
我正在使用Scrapy使用Scrapinghub/splash docker容器对Splash进行一些爬行,但是容器会在退出代码139之后退出一段时间,我在AWS EC2实例上运行刮刀,并分配了1GB交换.
我也尝试在后台运行它并稍后查看日志,但没有任何表示它只是退出的错误.
根据我的理解139是UNIX中的分段错误错误,无论如何都要检查或记录正在访问的内存部分或正在执行的代码来调试它?
或者我可以增加容器内存或交换大小以避免这种情况吗?
我正在尝试使用以下代码登录网站(针对本文稍作修改):
import scrapy
from scrapy_splash import SplashRequest
from scrapy.crawler import CrawlerProcess
class Login_me(scrapy.Spider):
name = 'espn'
allowed_domains = ['games.espn.com']
start_urls = ['http://games.espn.com/ffl/leaguerosters?leagueId=774630']
def start_requests(self):
script = """
function main(splash)
local url = splash.args.url
assert(splash:go(url))
assert(splash:wait(10))
local search_input = splash:select('input[type=email]')
search_input:send_text("user email")
local search_input = splash:select('input[type=password]')
search_input:send_text("user password!")
assert(splash:wait(10))
local submit_button = splash:select('input[type=submit]')
submit_button:click()
assert(splash:wait(10))
return html = splash:html()
end
"""
yield SplashRequest(
'http://games.espn.com/ffl/leaguerosters?leagueId=774630',
callback=self.after_login,
endpoint='execute',
args={'lua_source': script}
)
def after_login(self, response):
table = response.xpath('//table[@id="playertable_0"]')
for player in table.css('tr[id]'):
item = …
Run Code Online (Sandbox Code Playgroud) 我的spider.py文件是这样的:
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url,
self.parse,
headers={'My-Custom-Header':'Custom-Header-Content'},
meta={
'splash': {
'args': {
'html': 1,
'wait': 5,
},
}
},
)
Run Code Online (Sandbox Code Playgroud)
而我的解析定义如下:
def parse(self, response):
print(response.request.headers)
Run Code Online (Sandbox Code Playgroud)
当我运行Spider时,下面的行将作为标题打印:
{
b'Content-Type': [b'application/json'],
b'Accept': [b'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'],
b'Accept-Language': [b'en'],
b'User-Agent': [b'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.2309.372 Safari/537.36'],
b'Accept-Encoding': [b'gzip,deflate']
}
Run Code Online (Sandbox Code Playgroud)
如您所见,它没有我添加到Scrapy请求的自定义标头。
有人可以帮我添加此请求的自定义标头值吗?
提前致谢。
我正在尝试部署最新的 scrapinghub/splash
我在 win10 上使用 git-bash。我将 repo 分叉到https://github.com/kc1/splash/blob/master
我一直在尝试按照使用 docker,scrapy splash on Heroku来修改docker文件
使用 linux 行结尾克隆 repo 后:
git clone --config core.eol=lf <repository>
Run Code Online (Sandbox Code Playgroud)
我能够使用 heroku 命令行推送整个 repo:
$ heroku container:push web --app MYAPP
Run Code Online (Sandbox Code Playgroud)
做完之后:
$ heroku container:release web --app MYAPP
Run Code Online (Sandbox Code Playgroud)
我得到了截图。日志显示:
2019-05-26T15:38:29.843665+00:00 app[api]: Initial release by user myemail@gmail.com
2019-05-26T15:38:29.843665+00:00 app[api]: Release v1 created by user myemail@gmail.com
2019-05-26T15:38:29.961092+00:00 app[api]: Enable Logplex by user myemail@gmail.com
2019-05-26T15:38:29.961092+00:00 app[api]: Release v2 created by user myemail@gmail.com
2019-05-31T00:33:50.361100+00:00 app[api]: Deployed web …
Run Code Online (Sandbox Code Playgroud) I've built a simple scrapy spider running on scrapinghub:
class ExtractionSpider(scrapy.Spider):
name = "extraction"
allowed_domains = ['domain']
start_urls = ['http://somedomainstart']
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36"
def parse(self, response):
urls = response.css('a.offer-details__title-link::attr(href)').extract()
print(urls)
for url in urls:
url = response.urljoin(url)
yield SplashRequest(url=url, callback=self.parse_details)
multiple_locs_urls = response.css('a.offer-regions__label::attr(href)').extract()
print(multiple_locs_urls)
for url in multiple_locs_urls:
url = response.urljoin(url)
yield SplashRequest(url=url, callback=self.parse_details)
next_page_url = response.css('li.pagination_element--next > a.pagination_trigger::attr(href)').extract_first()
if next_page_url:
next_page_url = response.urljoin(next_page_url)
yield SplashRequest(url=next_page_url, callback=self.parse)
def parse_details(self, …
Run Code Online (Sandbox Code Playgroud) 我遇到了我的Lua脚本拒绝执行的问题。从ScrapyRequest调用返回的响应似乎是HTML正文,而我希望获得文档标题。我假设从未调用过Lua脚本,因为它似乎对响应没有明显影响。我已经在文档中进行了很多研究,似乎还无法弄清楚这里缺少什么。有没有人有什么建议?
from urlparse import urljoin
import scrapy
from scrapy_splash import SplashRequest
GOOGLE_BASE_URL = 'https://www.google.com/'
GOOGLE_QUERY_PARAMETERS = '#q={query}'
GOOGLE_SEARCH_URL = urljoin(GOOGLE_BASE_URL, GOOGLE_QUERY_PARAMETERS)
GOOGLE_SEARCH_QUERY = 'example search query'
LUA_SCRIPT = """
function main(splash)
assert(splash:go(splash.args.url))
return splash:evaljs("document.title")
end
"""
SCRAPY_CRAWLER_NAME = 'google_crawler'
SCRAPY_SPLASH_ENDPOINT = 'render.html'
SCRAPY_ARGS = {
'lua_source': LUA_SCRIPT
}
def get_search_url(query):
return GOOGLE_SEARCH_URL.format(query=query)
class GoogleCrawler(scrapy.Spider):
name=SCRAPY_CRAWLER_NAME
search_url = get_search_url(GOOGLE_SEARCH_QUERY)
def start_requests(self):
response = SplashRequest(self.search_url,
self.parse, endpoint=SPLASH_ENDPOINT, args=SCRAPY_ARGS)
yield response
def parse(self, response):
doc_title = response.body_as_unicode()
print doc_title
Run Code Online (Sandbox Code Playgroud) 我在使用 Splash 发出请求后尝试访问 cookie。以下是我构建请求的方式。
script = """
function main(splash)
splash:init_cookies(splash.args.cookies)
assert(splash:go{
splash.args.url,
headers=splash.args.headers,
http_method=splash.args.http_method,
body=splash.args.body,
})
assert(splash:wait(0.5))
local entries = splash:history()
local last_response = entries[#entries].response
return {
url = splash:url(),
headers = last_response.headers,
http_status = last_response.status,
cookies = splash:get_cookies(),
html = splash:html(),
}
end
"""
req = SplashRequest(
url,
self.parse_page,
args={
'wait': 0.5,
'lua_source': script,
'endpoint': 'execute'
}
)
Run Code Online (Sandbox Code Playgroud)
该脚本是 Splash 文档的精确副本。
所以我试图访问网页上设置的cookie。当我不使用 Splash 时,下面的代码将按我的预期工作,但在使用 Splash 时则不然。
self.logger.debug('Cookies: %s', response.headers.get('Set-Cookie'))
Run Code Online (Sandbox Code Playgroud)
使用 Splash 时会返回:
2017-01-03 12:12:37 [蜘蛛] 调试:Cookies:无
当我不使用 Splash 时,此代码可以工作并返回网页提供的 …
splash-js-render ×10
scrapy ×9
python ×6
docker ×2
amazon-ec2 ×1
heroku ×1
linux ×1
python-3.x ×1
scrapinghub ×1
web-crawler ×1
web-scraping ×1