RNs*_*ost 5 python xml beautifulsoup
我有一些xml:
<article>
<uselesstag></uslesstag>
<topic>oil, gas</topic>
<body>body text</body>
</article>
<article>
<uselesstag></uslesstag>
<topic>food</topic>
<body>body text</body>
</article>
<article>
<uselesstag></uslesstag>
<topic>cars</topic>
<body>body text</body>
</article>
Run Code Online (Sandbox Code Playgroud)
有许多很多无用的标签.我想使用beautifulsoup来收集body标签中的所有文本及其相关的主题文本,以创建一些新的xml.
我是python的新手,但我怀疑是某种形式的
import arff
from xml.etree import ElementTree
import re
from StringIO import StringIO
import BeautifulSoup
from BeautifulSoup import BeautifulSoup
totstring=""
with open('reut2-000.sgm', 'r') as inF:
for line in inF:
string=re.sub("[^0-9a-zA-Z<>/\s=!-\"\"]+","", line)
totstring+=string
soup = BeautifulSoup(totstring)
body = soup.find("body")
for anchor in soup.findAll('body'):
#Stick body and its topics in an associated array?
file.close
Run Code Online (Sandbox Code Playgroud)
将工作.
1)我该怎么做?2)我应该在XML中添加根节点吗?否则它是不正确的XML呢?
非常感谢
编辑:
我最终想要的是:
<article>
<topic>oil, gas</topic>
<body>body text</body>
</article>
<article>
<topic>food</topic>
<body>body text</body>
</article>
<article>
<topic>cars</topic>
<body>body text</body>
</article>
Run Code Online (Sandbox Code Playgroud)
有许多很多无用的标签.
好.这是解决方案,
首先,确保你安装了'beautifulsoup4':http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup
这是我获取所有正文和主题标签的代码:
from bs4 import BeautifulSoup
html_doc= """
<article>
<topic>oil, gas</topic>
<body>body text</body>
</article>
<article>
<topic>food</topic>
<body>body text</body>
</article>
<article>
<topic>cars</topic>
<body>body text</body>
</article>
"""
soup = BeautifulSoup(html_doc)
bodies = [a.get_text() for a in soup.find_all('body')]
topics = [a.get_text() for a in soup.find_all('topic')]
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
10968 次 |
最近记录: |