Str*_*man 1 python xpath lxml python-3.x python-requests
我想使用python中的request包从xpath提取URL。我可以得到文本,但没有尝试给出URL。有人可以帮忙吗?
ipdb> webpage.xpath(xpath_url + '/text()')
['Text of the URL']
ipdb> webpage.xpath(xpath_url + '/a()')
*** lxml.etree.XPathEvalError: Invalid expression
ipdb> webpage.xpath(xpath_url + '/href()')
*** lxml.etree.XPathEvalError: Invalid expression
ipdb> webpage.xpath(xpath_url + '/url()')
*** lxml.etree.XPathEvalError: Invalid expression
Run Code Online (Sandbox Code Playgroud)
我使用本教程开始学习:http : //docs.python-guide.org/en/latest/scenarios/scrape/
看起来应该很容易,但是在搜索过程中什么都没有发生。
谢谢。
你试过了webpage.xpath(xpath_url + '/@href')吗?
这是完整的代码:
from lxml import html
import requests
page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
webpage = html.fromstring(page.content)
webpage.xpath('//a/@href')
Run Code Online (Sandbox Code Playgroud)
结果应为:
[
'http://econpy.pythonanywhere.com/ex/002.html',
'http://econpy.pythonanywhere.com/ex/003.html',
'http://econpy.pythonanywhere.com/ex/004.html',
'http://econpy.pythonanywhere.com/ex/005.html'
]
Run Code Online (Sandbox Code Playgroud)
使用BeautifulSoup会更好:
from bs4 import BeautifulSoup
html = requests.get('testurl.com')
soup = BeautifulSoup(html, "lxml") # lxml is just the parser for reading the html
soup.find_all('a href') # this is the line that does what you want
Run Code Online (Sandbox Code Playgroud)
您可以打印该行,将其添加到列表等。要迭代它,请使用:
links = soup.find_all('a href')
for link in links:
print(link)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
12866 次 |
| 最近记录: |