nib*_*r90 7 scrapy web-scraping scrapy-splash
我想在文本输入字段中输入一个值,然后提交表单,并在表单提交后抓取页面上的新数据\n这怎么可能?
\n\n这是页面上的 html 表单。我想将输入值从 10 更改为 100 并提交表单
\n\n<form action="https://de.iss.fst.com/ba-u6-72-nbr-902-112-x-140-x-13-12-mm-simmerringr-ba-a-mit-feder-fst-40411416#product-offers-anchor" method="post" _lpchecked="1">\n <div class="fieldset">\n <div class="field qty">\n <div class="control">\n <label class="label" for="qty-2">\n <span>Preise f\xc3\xbcr</span>\n </label>\n <input type="text" name="pieces" class="validate-length maximum-length-10 qty" maxlength="12" id="qty-2" value="10">\n <label class="label" for="qty-2">\n <span>Teile</span>\n </label>\n <span class="actions">\n <button type="submit" title="Absenden" class="action">\n <span>Absenden</span>\n </button>\n </span>\n </div>\n </div>\n </div>\n </form>\nRun Code Online (Sandbox Code Playgroud)\n\n更新!\n新的工作代码。
\n\nimport scrapy\nimport pymongo\nfrom scrapy_splash import SplashRequest, SplashFormRequest\nfrom issfst.items import IssfstItem\n\n\nclass IssSpider(scrapy.Spider):\n name = "issfst_spider"\n start_urls = ["https://de.iss.fst.com/dichtungen/radialwellendichtringe/rwdr-mit-geschlossenem-kafig/ba"]\n custom_settings = {\n # specifies exported fields and order\n \'FEED_EXPORT_FIELDS\': ["imgurl",\n "Produktdatenblatt",\n "Materialdatenblatt",]\n }\n\n def parse(self, response):\n self.log("I just visted:" + response.url)\n urls = response.css(\'.details-button > a::attr(href)\').extract()\n\n for url in urls:\n formdata = {\'pieces\': \'200\'}\n yield SplashFormRequest.from_response(\n response,\n url=url,\n formdata=formdata,\n callback=self.parse_details,\n args={\'wait\': 3}\n )\n\n # follow paignation link\n next_page_url = response.css(\'li.item > a.next::attr(href)\').extract_first()\n if next_page_url:\n next_page_url = response.urljoin(next_page_url)\n yield scrapy.Request(url=next_page_url, callback=self.parse)\n\n def parse_details(self, response):\n item = IssfstItem()\n # scrape image url\n item[\'imgurl\'] = response.css(\'img.fotorama__img::attr(src)\').extract(),\n # scrape download pdf links\n item[\'Produktdatenblatt\'] = response.css(\'a.action[data-group="productdatasheet"]::attr(href)\').extract_first(),\n item[\'Materialdatenblatt\'] = response.css( \'a.action[data-group="materialdatasheet"]::attr(href)\').extract_first(),\n item[\'Beschreibung\'] = response.css(\'.description > p::text\').extract_first(),\n yield item\nRun Code Online (Sandbox Code Playgroud)\n
您不应该参考 html 源代码来了解 POST 请求的参数名称。您应该使用您喜欢的浏览器的开发人员工具并在保存日志的同时查看网络。
所以,您正在寻找网址https://de.iss.fst.com/ba-72-nbr-902-155-x-174-x-12-0-mm-simmerringr-ba-a-mit- feder-fst-40411424#product-offers-anchor并使用参数pieces和进行 POST form_key。
如果您使用错误的名称设置表单数据,'value'而网站需要该名称,则会发生错误'pieces'。
现在,作为 scrapy shell 会话中的演示:
scrapy shell "https://de.iss.fst.com/ba-72-nbr-902-155-x-174-x-12-0-mm-simmerringr-ba-a-mit-feder-fst-40411424"
...
from scrapy import FormRequest
##SETTING POST'S PARAMETERS
form_key = response.css('[name="form_key"]::attr(value)').get()
#Note response.xpath('input[@name="form_key"]/@value') returns nothing
#as far as I know for hidden element like this, css selection is the basic solution
pieces = "100"
form_data = {'form_key':form_key,'pieces':pieces} #with the correct names
##POST THE REQUEST
fetch(
FormRequest(
'https://de.iss.fst.com/ba-72-nbr-902-155-x-174-x-12-0-mm-simmerringr-ba-a-mit-feder-fst-40411424#product-offers-anchor',
formdata=form_data)
)#note the add of '#product-offers-anchor' to the url, instead it won't work
view(response) #to see the page your default browser
Run Code Online (Sandbox Code Playgroud)
现在您可以将上述内容调整为您的代码。
| 归档时间: |
|
| 查看次数: |
5674 次 |
| 最近记录: |