Pet*_*sen 2 python xml parsing
我正在尝试构建一个解析器并将结果保存为xml文件,但我有问题..
请问各位高手请看看我的代码?
追溯 :TypeError: expected string or buffer
import urllib2, re
from xml.dom.minidom import Document
from BeautifulSoup import BeautifulSoup as bs
osc = open('OSCTEST.html','r')
oscread = osc.read()
soup=bs(oscread)
doc = Document()
root = doc.createElement('root')
doc.appendChild(root)
countries = doc.createElement('countries')
root.appendChild(countries)
findtags1 = re.compile ('<h1 class="title metadata_title content_perceived_text(.*?)`</h1>', re.DOTALL | re.IGNORECASE).findall(soup)
findtags2 = re.compile ('<span class="content_text">(.*?)</span>', re.DOTALL | re.IGNORECASE).findall(soup)
for header in findtags1:
title_elem = doc.createElement('title')
countries.appendChild(title_elem)
header_elem = doc.createTextNode(header)
title_elem.appendChild(header_elem)
for item in findtags2:
art_elem = doc.createElement('artikel')
countries.appendChild(art_elem)
s = item.replace('<P>','')
t = s.replace('</P>','')
text_elem = doc.createTextNode(t)
art_elem.appendChild(text_elem)
print doc.toprettyxml()
Run Code Online (Sandbox Code Playgroud)
你试图使用BeautifulSoup解析HTML是好的,但这不起作用:
re.compile('<h1 class="title metadata_title content_perceived_text(.*?)`</h1>',
re.DOTALL | re.IGNORECASE).findall(soup)
Run Code Online (Sandbox Code Playgroud)
您正在尝试使用正则表达式解析BeautifulSoup对象.相反,你应该在汤上使用findAll方法,如下所示:
regex = re.compile('^title metadata_title content_perceived_text', re.IGNORECASE)
for tag in soup.findAll('h1', attrs = { 'class' : regex }):
print tag.contents
Run Code Online (Sandbox Code Playgroud)
如果您确实希望使用正则表达式将文档解析为文本,则不要使用BeautifulSoup - 只需将文档读入字符串并解析即可.但我建议你花点时间了解BeautifulSoup是如何工作的,因为这是首选的方法.有关详细信息,请参阅文档.
| 归档时间: |
|
| 查看次数: |
1063 次 |
| 最近记录: |