Sly*_*per 7 python scrapy python-2.7
使用 Scrapy,如何获取 Javascript 变量的值....
这是我的代码...
<script rel="bmc-data">
var match = 'yes';
var country = 'uk';
var tmData = {
"googleExperimentVariation": "1",
"pageTitle": "Child Care",
"page_type": "claimed",
"company_state": "wyostate",
"company_city": "mycity"
};
</script>
Run Code Online (Sandbox Code Playgroud)
我想检查page_type变量的值。如果它的“已声明”处理页面,否则继续......
我试过这个...
pattern = r'page_type = "(\w+)",'
response.xpath('//script[@rel="bmc-data"]').re(pattern)
Run Code Online (Sandbox Code Playgroud)
但是当然这不起作用,因为我认为我的正则表达式是错误的。
我可以建议js2xml为此使用(免责声明:我写了 js2xml)
>>> import scrapy
>>> import js2xml
>>> html = '''<script rel="bmc-data">
... var match = 'yes';
... var country = 'uk';
... var tmData = {
... "googleExperimentVariation": "1",
... "pageTitle": "Child Care",
... "page_type": "claimed",
... "company_state": "wyostate",
... "company_city": "mycity"
... };
... </script>'''
>>> selector = scrapy.Selector(text=html)
>>> selector.xpath('//script/text()').extract_first()
u'\n var match = \'yes\';\n var country = \'uk\';\n var tmData = {\n "googleExperimentVariation": "1",\n "pageTitle": "Child Care",\n "page_type": "claimed",\n "company_state": "wyostate",\n "company_city": "mycity"\n };\n'
>>> jscode = selector.xpath('//script/text()').extract_first()
>>> jstree = js2xml.parse(jscode)
>>> print(js2xml.pretty_print(jstree))
<program>
<var name="match">
<string>yes</string>
</var>
<var name="country">
<string>uk</string>
</var>
<var name="tmData">
<object>
<property name="googleExperimentVariation">
<string>1</string>
</property>
<property name="pageTitle">
<string>Child Care</string>
</property>
<property name="page_type">
<string>claimed</string>
</property>
<property name="company_state">
<string>wyostate</string>
</property>
<property name="company_city">
<string>mycity</string>
</property>
</object>
</var>
</program>
>>> jstree.xpath('//var[@name="tmData"]/object')[0]
<Element object at 0x7f0b0018f050>
>>> from pprint import pprint
>>> data = js2xml.jsonlike.make_dict(jstree.xpath('//var[@name="tmData"]/object')[0])
>>> pprint(data)
{'company_city': 'mycity',
'company_state': 'wyostate',
'googleExperimentVariation': '1',
'pageTitle': 'Child Care',
'page_type': 'claimed'}
>>> data['page_type']
'claimed'
>>>
Run Code Online (Sandbox Code Playgroud)
您的正则表达式模式在这里有问题:
# you are looking for this bit: "page_type": "claimed",
re.findall('page_type": "(.+)"', html_body)
# ["claimed"]
Run Code Online (Sandbox Code Playgroud)
或者在您的情况下在scrapy Selectors的上下文中:
response.xpath('//script[@rel="bmc-data"]').re('page_type": "(.+)"')
Run Code Online (Sandbox Code Playgroud)
如果您需要像这样解析多个变量,我建议使用 Paul 提到的答案,因为正则表达式并不总是像 xml 解析那样可靠。