msh*_*rir 11 python regex screen-scraping
我想在HTML中获取隐藏输入字段的值.
<input type="hidden" name="fooId" value="12-3456789-1111111111" />
Run Code Online (Sandbox Code Playgroud)
我想在Python中编写一个正则表达式,它将返回值fooId
,因为我知道HTML中的行遵循格式
<input type="hidden" name="fooId" value="**[id is here]**" />
Run Code Online (Sandbox Code Playgroud)
有人可以在Python中提供一个示例来解析HTML的值吗?
Vin*_*vic 27
对于这个特殊情况,BeautifulSoup比正则表达式更难写,但它更强大......我只是贡献了BeautifulSoup示例,因为你已经知道使用哪个正则表达式:-)
from BeautifulSoup import BeautifulSoup
#Or retrieve it from the web, etc.
html_data = open('/yourwebsite/page.html','r').read()
#Create the soup object from the HTML data
soup = BeautifulSoup(html_data)
fooId = soup.find('input',name='fooId',type='hidden') #Find the proper tag
value = fooId.attrs[2][1] #The value of the third attribute of the desired tag
#or index it directly via fooId['value']
Run Code Online (Sandbox Code Playgroud)
小智 18
我同意Vinko BeautifulSoup是要走的路.不过我建议使用fooId['value']
来获取属性,而不是依靠值是第三属性.
from BeautifulSoup import BeautifulSoup
#Or retrieve it from the web, etc.
html_data = open('/yourwebsite/page.html','r').read()
#Create the soup object from the HTML data
soup = BeautifulSoup(html_data)
fooId = soup.find('input',name='fooId',type='hidden') #Find the proper tag
value = fooId['value'] #The value attribute
Run Code Online (Sandbox Code Playgroud)
import re
reg = re.compile('<input type="hidden" name="([^"]*)" value="<id>" />')
value = reg.search(inputHTML).group(1)
print 'Value is', value
Run Code Online (Sandbox Code Playgroud)