从 BeautifulSoup 对象获取 URL

Question

从 BeautifulSoup 对象获取 URL

有人将他使用典型调用获得的 BeautifulSoup 对象 (BS4) 交给我的函数：

soup = BeautifulSoup(url)

Run Code Online (Sandbox Code Playgroud)

我的代码：

def doSomethingUseful(soup):
    url = soup.???

Run Code Online (Sandbox Code Playgroud)

如何从汤对象中获取原始 URL？我试着阅读文档和 BeautifulSoup 源代码......我仍然不确定。

Answer 1

Bil*_* M. 5

如果url变量是实际 URL 的字符串，那么您应该忘记这里的 BeautifulSoup 并使用相同的变量url。您应该使用 BeautifulSoup 来解析 HTML 代码，而不是简单的 URL。事实上，如果你尝试像这样使用它，你会收到警告：

>>> from bs4 import BeautifulSoup
>>> url = "https://foo"
>>> soup = BeautifulSoup(url)
C:\Python27\lib\site-packages\bs4\__init__.py:336: UserWarning: "https://foo" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
  ' that document to Beautiful Soup.' % decoded_markup

Run Code Online (Sandbox Code Playgroud)

由于 URL 只是一个字符串，当你“soupify”它时，BeautifulSoup 并不真正知道如何处理它，除了将它包装在基本的 HTML 中：

>>> soup
<html><body><p>https://foo</p></body></html>

Run Code Online (Sandbox Code Playgroud)

如果你仍然想从中提取 URL，你可以只.text在对象上使用，因为它是那里唯一的东西：

>>> print(soup.text)
https://foo

Run Code Online (Sandbox Code Playgroud)

另一方面，如果url根本不是真正的 URL 而是一堆 HTML 代码（在这种情况下，变量名会非常具有误导性），那么您如何提取其中的特定链接将回避它是如何存在的问题你的代码。执行 afind以获取第一个a标签，然后提取该href值将是一种方法。

>>> actual_html = '<html><body><a href="http://moo">My link text</a></body></html>'
>>> newsoup = BeautifulSoup(actual_html)
>>> newsoup.find('a')['href']
'http://moo'

Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年，9 月前
查看次数：	7960 次
最近记录：	6 年，9 月前