我正在学习如何使用scrapy + splash.我已经在虚拟环境中创建了一个项目,现在我正在做这个教程:https://github.com/scrapy-plugins/scrapy-splash.
我跑过了:
$ docker run -p 8050:8050 scrapinghub/splash
Run Code Online (Sandbox Code Playgroud)
结果导致:
2017-01-12 09:18:50+0000 [-] Log opened.
2017-01-12 09:18:50.225754 [-] Splash version: 2.3
2017-01-12 09:18:50.227033 [-] Qt 5.5.1, PyQt 5.5.1, WebKit 538.1, sip 4.17, Twisted 16.1.1, Lua 5.2
2017-01-12 09:18:50.227201 [-] Python 3.4.3 (default, Nov 17 2016, 01:08:31) [GCC 4.8.4]
2017-01-12 09:18:50.227645 [-] Open files limit: 1048576
2017-01-12 09:18:50.227882 [-] Can't bump open files limit
2017-01-12 09:18:50.333978 [-] Xvfb is started: ['Xvfb', ':1', '-screen', '0', '1024x768x24']
2017-01-12 …Run Code Online (Sandbox Code Playgroud) 之前曾有人问过这个问题,但总会出现的答案是使用DjangoItem。但是它在github上指出:
通常对于写密集型应用程序(例如Web爬网程序)来说不是一个好的选择...可能无法很好地扩展
这是我问题的症结所在,我想以与我运行python manage.py shell以及从myapp.models import Model1进行操作时相同的方式使用Django模型并与之交互。使用查询如此处所示。
我已经尝试了相对导入,并将我的整个scrapy项目移到了django应用程序中,但都无济于事。
我应该将我的拼凑项目移到哪里进行这项工作?我该如何重新创建/使用所有在Scrapy管道内的Shell中可用的方法?
提前致谢。
我试图从网页上抓取数据.该网页只是2500个URL的项目符号列表.Scrapy获取并转到每个URL并获取一些数据......
这是我的代码
class MySpider(CrawlSpider):
name = 'dknews'
start_urls = ['http://www.example.org/uat-area/scrapy/all-news-listing']
allowed_domains = ['example.org']
def parse(self, response):
hxs = Selector(response)
soup = BeautifulSoup(response.body, 'lxml')
nf = NewsFields()
ptype = soup.find_all(attrs={"name":"dkpagetype"})
ptitle = soup.find_all(attrs={"name":"dkpagetitle"})
pturl = soup.find_all(attrs={"name":"dkpageurl"})
ptdate = soup.find_all(attrs={"name":"dkpagedate"})
ptdesc = soup.find_all(attrs={"name":"dkpagedescription"})
for node in soup.find_all("div", class_="module_content-panel-sidebar-content"):
ptbody = ''.join(node.find_all(text=True))
ptbody = ' '.join(ptbody.split())
nf['pagetype'] = ptype[0]['content'].encode('ascii', 'ignore')
nf['pagetitle'] = ptitle[0]['content'].encode('ascii', 'ignore')
nf['pageurl'] = pturl[0]['content'].encode('ascii', 'ignore')
nf['pagedate'] = ptdate[0]['content'].encode('ascii', 'ignore')
nf['pagedescription'] = ptdesc[0]['content'].encode('ascii', 'ignore')
nf['bodytext'] = ptbody.encode('ascii', 'ignore')
yield nf
for …Run Code Online (Sandbox Code Playgroud) 我可以在本地运行我的scrapy没有任何问题,但是,当我尝试从scrapinghub运行工作时我得到以下错误(连接到mongo atlas云):
exceptions.ImportError: No module named pymodm
Run Code Online (Sandbox Code Playgroud)
我导入使用:
import pymodm
Run Code Online (Sandbox Code Playgroud)
任何帮助深表感谢.
干杯
我正在尝试安装Scrappy。我已经安装了Python 3.6,并且在Windows上。
我已经试过了:
py -3.5-32 -m pip install Scrappy
Run Code Online (Sandbox Code Playgroud)
但是,得到以下内容:
Collecting Scrappy
Using cached Scrappy-0.3.0.alpha.4.tar.gz
Collecting guessit (from Scrappy)
Using cached guessit-2.1.2.tar.gz
Collecting tvdb_api (from Scrappy)
Using cached tvdb_api-1.10.tar.gz
Collecting hachoir-metadata (from Scrappy)
Using cached hachoir-metadata-1.3.3.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\Users\USER~1.NAME\AppData\Local\Temp\pip-build-1tu2hkos\hachoir-metadata\setup.py", line 65
except OSError, err:
^
SyntaxError: invalid syntax
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in C:\Users\USER~1.NAME\AppData\Local\Temp\pip-build-1tu2hkos\hachoir-metadata\
Run Code Online (Sandbox Code Playgroud)
是因为setuptools什么?因此,我对其进行了升级,然后尝试安装Scrappy,但收到了相同的错误消息。 …
我已经通过Scrapy文档今天一直在进行,并试图获得一个工作版本- https://docs.scrapy.org/en/latest/intro/tutorial.html#our-first-spider -在现实世界的例子。我的示例稍有不同,它有2个下一页,即
start_url>城市页面>单位页面
这是我要从中获取数据的单位页面。
我的代码:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://www.unitestudents.com/',
]
def parse(self, response):
for quote in response.css('div.property-body'):
yield {
'name': quote.xpath('//span/a/text()').extract(),
'type': quote.xpath('//div/h4/text()').extract(),
'price_amens': quote.xpath('//div/p/text()').extract(),
'distance_beds': quote.xpath('//li/p/text()').extract()
}
# Purpose is to crawl links of cities
next_page = response.css('a.listing-item__link::attr(href)').extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
# Purpose is to crawl links of units
next_unit_page = response.css(response.css('a.text-highlight__inner::attr(href)').extract_first())
if next_unit_page is not None:
next_unit_page = response.urljoin(next_unit_page)
yield scrapy.Request(next_unit_page, callback=self.parse) …Run Code Online (Sandbox Code Playgroud) 我可以使用以下命令获得所需的跨度列表:
attrs = soup.find_all("span")
Run Code Online (Sandbox Code Playgroud)
这将返回一个范围列表,作为键和值:
[
<span>back camera resolution</span>,
<span class="even">12 MP</span>
]
[
<span>front camera resolution</span>,
<span class="even">16 MP</span>
]
[
<span>video resolution</span>,
<span class="even"><a class="__cf_email__" data-cfemail="b98b888f89c9f98a89dfc9ca" href="/cdn-cgi/l/email-protection">[email protected]</a><script data-cfhash="f9e31" type="text/javascript">/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */</script> - <a class="__cf_email__" data-cfemail="4677767e7636067576203635" href="/cdn-cgi/l/email-protection">[email protected]</a><script data-cfhash="f9e31" type="text/javascript">/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */</script> - <a class="__cf_email__" data-cfemail="5067626010616260362023" href="/cdn-cgi/l/email-protection">[email protected]</a><script data-cfhash="f9e31" type="text/javascript">/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */</script>
</span>
]
Run Code Online (Sandbox Code Playgroud)
原始的HTML是:
为什么“视频分辨率”是这样转换的?
我正在尝试使用python从网站抓取一些数据.在最初阶段,它运行良好,但最近它开始阻止脚本请求以及我的系统中的浏览器.我知道目标网站会有一些阻止机制,但我可以从隐身模式浏览同一个网站,没有任何问题.
我该如何解决这个问题?我尝试使用隐身浏览器中使用的一些标头值,但没有结果.
任何帮助赞赏.
为尝试添加尝试的标头值
"Accept":" / ","Accept-Encoding":"gzip,deflate,br","Accept-Language":"en-US,en; q = 0.8","Cache-Control":"no-cache" ","Connection":"keep-alive","Content-Length":"8","Content-Type":"application/x-www-form-urlencoded","Cookie":"JSESSIONID = 6 + b5vN7wfvBUHfQOK0d7bls ; TS01747e58 = 01d69c8eb5156bae15c47b3d2578bc88361c69fb48d9ec815b7e3e48aaab4afb974a4d8f5b448e558bfcd1da01f6246b460e8d88a2f87a126f095a23ccdd3d50439c61ecc9; BIGipServerjboss = 759271946.20480.0000; TS011968e6_28 = 01fabe97068921f1b57e70731e79cb34f9d73bcf98d7d1f65c7eb46ba87d3e6e751dec2ee2109c7bc65b7e3cdb05d397b47bdaf21e; TS011968e6 = 01d69c8eb5a3b1ea223ea72b0b4ace9a0ac39268b9d9ec815b7e3e48aaab4afb974a4d8f5b4d619ddc6882f5ecbd3007321d57f557b77bb39ff7ab95e2310bfa4be41364ef " "主机": "abc.co.in", "产地":" https://abc.co.in ""杂注":"no-cache","Referer":" https://abc.jsp ","User-Agent":"Mozilla/5.0(X11; Linux x86_64)AppleWebKit/537.36(KHTML,与Gecko一样)Chrome/52.0 .2743.116 Safari/537.36","X-Requested-With":"XMLHttpRequest","X-TS-A JAX-请求":"真",
我正试图通过Scrapy抓一个网站.但是,网站有时非常慢,在浏览器的第一次请求响应时需要大约15-20秒.无论如何,有时,当我尝试使用Scrapy抓取网站时,我不断收到TCP超时错误.即使网站在我的浏览器上打开正常.这是消息:
2017-09-05 17:34:41 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.hosane.com/result/spec
ialList> (failed 16 times): TCP connection timed out: 10060: A connection attempt failed because the connected party di
d not properly respond after a period of time, or established connection failed because connected host has failed to re
spond..
Run Code Online (Sandbox Code Playgroud)
我甚至已经覆盖了USER_AGENT测试设置.我不认为DOWNLOAD_TIMEOUT设置在这种情况下有效,因为它默认为180秒,而Scrapy在发出TCP超时错误之前甚至不需要20-30秒.
知道是什么导致了这个问题吗?有没有办法在Scrapy中设置TCP超时?
实际上我想将与特定网站相关的所有数据(文本,参考,图像)存储到一个文件夹中。为此,我需要将该文件夹的路径传递给所有不同的解析函数。所以我想传递此路径scrapy.Request()像这样额外的垃圾:
yield scrapy.Request(url=url,dont_filter=True, callback=self.parse,errback = self.errback_function,kwargs={'path': '/path/to_folder'})
但是它给出了错误 TypeError: __init__() got an unexpected keyword argument 'kwargs'
如何将该路径传递给下一个功能?
scrapy ×10
python ×8
web-scraping ×4
django ×1
mongodb ×1
pip ×1
pymodm ×1
scrapinghub ×1
windows ×1