我正在使用Python和Scrapy框架开发一个Web爬行项目.它从电子商务购物网站抓取approax 10k网页.整个项目工作正常但在将代码从测试服务器移动到生产服务器之前我想选择一个更好的代理ip提供商服务,这样我就不用担心我的IP阻止或拒绝访问我的蜘蛛网站.
到目前为止,我在Scrapy中使用中间件从这样的各种网站的免费代理ip列表中手动旋转ip
现在我对我应该选择的选项感到困惑
使用TOR
使用VPN服务,如http://www.hotspotshield.com/
任何选项优于上述三项
我正在使用mongodb来使用scrapy框架存储网页的原始HTML数据.在网络抓取的一天,25GB磁盘空间被填满.有没有办法以压缩格式存储原始数据.
我正在使用图像管道从不同的网站下载所有图像.
所有图像都已成功下载到我定义的文件夹中,但在保存到硬盘之前,我无法命名所选的下载图像.
这是我的代码
class jellyImagesPipeline(ImagesPipeline):
def image_key(self, url, item):
name = item['image_name']
return 'full/%s.jpg' % (name)
def get_media_requests(self, item, info):
print'Entered get_media_request'
for image_url in item['image_urls']:
yield Request(image_url)
Run Code Online (Sandbox Code Playgroud)
Image_spider.py
def getImage(self, response):
item = JellyfishItem()
item['image_urls']= [response.url]
item['image_name']= response.meta['image_name']
return item
Run Code Online (Sandbox Code Playgroud)
我需要在代码中做些哪些更改?
更新1
pipelines.py
class jellyImagesPipeline(ImagesPipeline):
def image_custom_key(self, response):
print '\n\n image_custom_key \n\n'
name = response.meta['image_name'][0]
img_key = 'full/%s.jpg' % (name)
print "custom image key:", img_key
return img_key
def get_images(self, response, request, info):
print "\n\n get_images \n\n"
for key, …Run Code Online (Sandbox Code Playgroud) 是否有任何GUI或控制台可以与Redis数据库一起使用.就像Mysql数据库有Mysql workbench和mysql控制台一样.因此,从redis控制台我们可以为键保存值并检索到目前为止保存的所有键.
我已经安装了Redis桌面管理器,但我无法运行,因为我的ip不断更改哪个需要权限.
我如何使用 python 对列表格式进行排序
format=["12 sheet","4 sheet","48 sheet","6 sheet", "busrear", "phonebox","train"]
Run Code Online (Sandbox Code Playgroud)
像这样
format =["4 sheet", "6 sheet", "12 sheet", "48 sheet", "busrear", "phonebox", "train"]
Run Code Online (Sandbox Code Playgroud)
但是如果数组是列表的列表那么我们怎样才能像这样做到这一点
format=[[1, '12 sheet', 0],[2, '4 sheet', 0], [3, '48 sheet', 0], [4, '6 sheet', 0 [5, 'Busrear', 0], [6, 'phonebox', 0], [7, 'train', 0]]
Run Code Online (Sandbox Code Playgroud)
我需要这样的结果
format=[[2, '4 sheet', 0],[4, '6 sheet', 0],[1, '12 sheet', 0],[3, '48 sheet', 0],[5, 'Busrear', 0], [6, 'phonebox', 0], [7, 'train', 0]]
Run Code Online (Sandbox Code Playgroud) 我编写了一个python代码来获取与给定URL对应的Web页面,并将该页面上的所有链接解析为链接库.接下来,它从刚刚创建的存储库中获取任何url的内容,将此新内容中的链接解析到存储库中,并继续对存储库中的所有链接执行此过程,直到停止或在获取给定数量的链接之后.
这里代码:
import BeautifulSoup
import urllib2
import itertools
import random
class Crawler(object):
"""docstring for Crawler"""
def __init__(self):
self.soup = None # Beautiful Soup object
self.current_page = "http://www.python.org/" # Current page's address
self.links = set() # Queue with every links fetched
self.visited_links = set()
self.counter = 0 # Simple counter for debug purpose
def open(self):
# Open url
print self.counter , ":", self.current_page
res = urllib2.urlopen(self.current_page)
html_code = res.read()
self.visited_links.add(self.current_page)
# Fetch every links
self.soup = BeautifulSoup.BeautifulSoup(html_code)
page_links = []
try …Run Code Online (Sandbox Code Playgroud) 当用户成功登录并进入主页时,有一个链接"更改密码"用于更改密码.它显示一个表格来更改密码,有三个输入框用于旧密码,新密码确认新密码
这是我的代码.
forms.py
class reset_form(forms.Form):
oldpassword = forms.CharField(max_length = 20, widget=forms.TextInput(attrs={'type':'password', 'placeholder':'your old Password', 'class' : 'span'}))
newpassword1 = forms.CharField(max_length = 20, widget=forms.TextInput(attrs={'type':'password', 'placeholder':'New Password', 'class' : 'span'}))
newpassword2 = forms.CharField(max_length = 20, widget=forms.TextInput(attrs={'type':'password', 'placeholder':'Confirm New Password', 'class' : 'span'}))
def clean(self):
if 'newpassword1' in self.cleaned_data and 'newpassword2' in self.cleaned_data:
if self.cleaned_data['newpassword1'] != self.cleaned_data['newpassword2']:
raise forms.ValidationError(_("The two password fields did not match."))
return self.cleaned_data
Run Code Online (Sandbox Code Playgroud)
views.py
def change_password(request):
if request.method == 'POST':
form = reset_form(request.POST)
if form.is_valid():
newpassword=form.cleaned_data['newpassword1'],
username=request.user.username
password=request.user.password
user …Run Code Online (Sandbox Code Playgroud) 嗨,在我的项目中,我需要在order by子句的mysql查询中使用index关键字.
我的查询如下所示:
SELECT asset.id, day_part.category, asset.site_id, site.name,
environment.category_1, environment.category_2, county.town, county.county,
site.site_id as siteid, media_owner.contractor_name,
IF (audience_category.impact IS NULL, 0, audience_category.impact) as impact,
tv_region.id as tv_region_id,
metropolitan.id as metropolitan_id,
IF (
price.price_site = -1,
IF(
price.price_tv_region = -1,
price.price_nation,
price.price_tv_region
),
price.price_site
) AS price,
format.name AS format,
format.id AS format_id
FROM asset
JOIN site ON asset.site_id = site.id
JOIN day_part ON asset.day_part_id = day_part.id
JOIN media_owner ON site.media_owner_id = media_owner.id
JOIN area ON site.area_id = area.id
JOIN environment …Run Code Online (Sandbox Code Playgroud) 我有一个情况,我的注册页面http://localhost:8000/signUp?qid=ca1480f4在成功注册后在 url 上提供,我想使用此查询将用户重定向到登录页面?qid=ca1480f4
视图.py
class SignUp(FormView):
success_url = '/login'
def post(self, request, *args, **kwargs):
form = self.form_class(request.POST)
if form.is_valid():
form.save()
return HttpResponseRedirect(self.get_success_url())
else:
return self.form_invalid(form)
Run Code Online (Sandbox Code Playgroud)
我需要做哪些改变我得到了那个查询 self.request.META['QUERY_STRING']
我如何使用python对列表格式进行排序
format=["12 sheet","4 sheet","48 sheet","6 sheet", "busrear", "phonebox","train"]
Run Code Online (Sandbox Code Playgroud)
像这样
format =["4 sheet", "6 sheet", "12 sheet", "48 sheet", "busrear, "phonebox", "train"]
Run Code Online (Sandbox Code Playgroud)
编辑:如果数组是列表列表,那么我们怎么能像这样做
format=[[1L, u'12 sheet', 0],[2L, u'4 sheet', 0], [3L, u'48 sheet', 0], [4L, u'6 sheet', 0 [5L, u'Busrear', 0], [6L, u'phonebox', 0], [7L, u'train', 0]]
python ×5
django ×2
list ×2
scrapy ×2
compression ×1
django-forms ×1
django-views ×1
image ×1
mongodb ×1
mysql ×1
proxy ×1
redis ×1
sorting ×1
sql ×1
tor ×1
web-crawler ×1