Mic*_*Lee 0 python regex beautifulsoup
我想删除
[<span class="street-address">
510 E Airline Way
</span>]
Run Code Online (Sandbox Code Playgroud)
我已经使用这个清洁功能来删除它们之间的那个 < >
def clean(val):
if type(val) is not StringType: val = str(val)
val = re.sub(r'<.*?>', '',val)
val = re.sub("\s+" , " ", val)
return val.strip()
Run Code Online (Sandbox Code Playgroud)
它产生了 [ 510 E Airline Way ]
我试图内"干净"的功能添加到删除字符'['和']',基本上我只是想要得到的"510 E Airline Way".
任何人都有任何线索我可以添加什么clean功能?
谢谢
使用re:
>>> import re
>>> s='[<span class="street-address">\n 510 E Airline Way\n </span>]'
>>> re.sub(r'\[|\]|\s*<[^>]*>\s*', '', s)
'510 E Airline Way'
Run Code Online (Sandbox Code Playgroud)
使用BeautifulSoup:
>>> from BeautifulSoup import BeautifulSoup
>>> s='[<span class="street-address">\n 510 E Airline Way\n </span>]'
>>> b = BeautifulSoup(s)
>>> b.find('span').getText()
u'510 E Airline Way'
Run Code Online (Sandbox Code Playgroud)
使用lxml:
>>> from lxml import html
>>> s='[<span class="street-address">\n 510 E Airline Way\n </span>]'
>>> h = html.document_fromstring(s)
>>> h.cssselect('span')[0].text.strip()
'510 E Airline Way'
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
3675 次 |
| 最近记录: |