如何在Python中的<h1> </ h1>之间提取文本？

Question

如何在Python中的<h1> </ h1>之间提取文本？

cra*_*rax 3 html python tags extract beautifulsoup

我被困在<h1>和之间提取文本</h1>.

请帮我.

我的代码是:

import bs4
import re
import urllib2

url2='http://www.flipkart.com/mobiles/pr?sid=tyy,4io&otracker=ch_vn_mobile_filter_Top%20Brands_All#jumpTo=0|20'
htmlf = urllib2.urlopen(url2)
soup = bs4.BeautifulSoup(htmlf)
#res=soup.findAll('div',attrs={'class':'product-unit'})
for res in soup.findAll('a',attrs={'class':'fk-display-block'}):
    suburl='http://www.flipkart.com/'+res.get('href')
    subhtml = urllib2.urlopen(suburl)
    subhtml = subhtml.read()
    subhtml = re.sub(r'\s\s+','',subhtml)
    subsoup=bs4.BeautifulSoup(subhtml)
    res2=subsoup.find('h1',attrs={'itemprop':'name'})
    if res2:
        print res2

Run Code Online (Sandbox Code Playgroud)

输出:

<h1 itemprop="name">Moto G</h1>
<h1 itemprop="name">Moto E</h1>
<h1 itemprop="name">Moto E</h1>

Run Code Online (Sandbox Code Playgroud)

但我想要这个:

Moto G
Moto E
Moto E

Run Code Online (Sandbox Code Playgroud)

Answer 1

sha*_*aan 5

在任何HTML标记上,执行a get_text()都会提供与标记关联的文本.所以,你只需要get_text()在res2 上使用.即

if res2:
    print res2.get_text()

Run Code Online (Sandbox Code Playgroud)

PS:作为旁注,我认为subhtml = re.sub(r'\s\s+','',subhtml)代码中的这一行是一项昂贵的操作.如果您正在做的就是摆脱过多的空间,您可以这样做:

if res2:
    print res2.get_text().strip()

Run Code Online (Sandbox Code Playgroud)

归档时间：	11 年，5 月前
查看次数：	3892 次
最近记录：	11 年，5 月前