使用Python请求提取href URL

Question

使用Python请求提取href URL

Str*_*man 1 python xpath lxml python-3.x python-requests

我想使用python中的request包从xpath提取URL。我可以得到文本，但没有尝试给出URL。有人可以帮忙吗？

ipdb> webpage.xpath(xpath_url + '/text()')
['Text of the URL']
ipdb> webpage.xpath(xpath_url + '/a()')
*** lxml.etree.XPathEvalError: Invalid expression
ipdb> webpage.xpath(xpath_url + '/href()')
*** lxml.etree.XPathEvalError: Invalid expression
ipdb> webpage.xpath(xpath_url + '/url()')
*** lxml.etree.XPathEvalError: Invalid expression

Run Code Online (Sandbox Code Playgroud)

我使用本教程开始学习：http : //docs.python-guide.org/en/latest/scenarios/scrape/

看起来应该很容易，但是在搜索过程中什么都没有发生。

谢谢。

Answer 1

jer*_*ija 6

你试过了webpage.xpath(xpath_url + '/@href')吗？

这是完整的代码：

from lxml import html
import requests

page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
webpage = html.fromstring(page.content)

webpage.xpath('//a/@href')

Run Code Online (Sandbox Code Playgroud)

结果应为：

[
  'http://econpy.pythonanywhere.com/ex/002.html',
  'http://econpy.pythonanywhere.com/ex/003.html', 
  'http://econpy.pythonanywhere.com/ex/004.html',
  'http://econpy.pythonanywhere.com/ex/005.html'
]

Run Code Online (Sandbox Code Playgroud)

Answer 2

n1c*_*1c9 5

使用BeautifulSoup会更好：

from bs4 import BeautifulSoup

html = requests.get('testurl.com')
soup = BeautifulSoup(html, "lxml") # lxml is just the parser for reading the html
soup.find_all('a href') # this is the line that does what you want

Run Code Online (Sandbox Code Playgroud)

您可以打印该行，将其添加到列表等。要迭代它，请使用：

links = soup.find_all('a href')
for link in links:
    print(link)

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年，2 月前
查看次数：	12866 次
最近记录：	6 年，2 月前