sun*_*eep 5 html python regex beautifulsoup
给出像这样的HTML链接
<a href="urltxt" class="someclass" close="true">texttxt</a>
Run Code Online (Sandbox Code Playgroud)
我该如何隔离网址和文字?
更新
我正在使用Beautiful Soup,我无法弄清楚如何做到这一点.
我做到了
soup = BeautifulSoup.BeautifulSoup(urllib.urlopen(url))
links = soup.findAll('a')
for link in links:
print "link content:", link.content," and attr:",link.attrs
Run Code Online (Sandbox Code Playgroud)
我明白了
*link content: None and attr: [(u'href', u'_redirectGeneric.asp?genericURL=/root /support.asp')]* ...
...
Run Code Online (Sandbox Code Playgroud)
为什么我错过了内容?
编辑:详细说明'卡住'建议:)
使用美丽的汤.自己做比看起来更难,你最好使用经过试验和测试的模块.
编辑:
我想你想要:
soup = BeautifulSoup.BeautifulSoup(urllib.urlopen(url).read())
Run Code Online (Sandbox Code Playgroud)
顺便说一下,尝试在那里打开URL是一个坏主意,就好像它出错了它可能会变得丑陋.
编辑2:
这应该显示页面中的所有链接:
import urlparse, urllib
from BeautifulSoup import BeautifulSoup
url = "http://www.example.com/index.html"
source = urllib.urlopen(url).read()
soup = BeautifulSoup(source)
for item in soup.fetchall('a'):
try:
link = urlparse.urlparse(item['href'].lower())
except:
# Not a valid link
pass
else:
print link
Run Code Online (Sandbox Code Playgroud)
这是一个代码示例,显示了获取链接的属性和内容:
soup = BeautifulSoup.BeautifulSoup(urllib.urlopen(url))
for link in soup.findAll('a'):
print link.attrs, link.contents
Run Code Online (Sandbox Code Playgroud)