Jam*_*ams 6 python xml beautifulsoup elementtree
我正在使用 Python 和 BeautifulSoup 来解析和访问 XML 文档中的元素。我修改了几个元素的值,然后将 XML 写回到文件中。问题在于,更新后的 XML 文件在每个 XML 元素的文本值的开头和结尾处都包含换行符,导致文件如下所示:
<annotation>
<folder>
Definitiva
</folder>
<filename>
armas_229.jpg
</filename>
<path>
/tmp/tmpygedczp5/handgun/images/armas_229.jpg
</path>
<size>
<width>
1800
</width>
<height>
1426
</height>
<depth>
3
</depth>
</size>
<segmented>
0
</segmented>
<object>
<name>
handgun
</name>
<pose>
Unspecified
</pose>
<truncated>
0
</truncated>
<difficult>
0
</difficult>
<bndbox>
<xmin>
1001
</xmin>
<ymin>
549
</ymin>
<xmax>
1453
</xmax>
<ymax>
1147
</ymax>
</bndbox>
</object>
</annotation>
Run Code Online (Sandbox Code Playgroud)
相反,我宁愿让输出文件看起来像这样:
<annotation>
<folder>Definitiva</folder>
<filename>armas_229.jpg</filename>
<path>/tmp/tmpygedczp5/handgun/images/armas_229.jpg</path>
<size>
<width>1800</width>
<height>1426</height>
<depth>3</depth>
</size>
<segmented>0</segmented>
<object>
<name>handgun</name>
<pose>Unspecified</pose>
<truncated>0</truncated>
<difficult>0</difficult>
<bndbox>
<xmin>1001</xmin>
<ymin>549</ymin>
<xmax>1453</xmax>
<ymax>1147</ymax>
</bndbox>
</object>
</annotation>
Run Code Online (Sandbox Code Playgroud)
我打开文件并得到“汤”,如下所示:
<annotation>
<folder>
Definitiva
</folder>
<filename>
armas_229.jpg
</filename>
<path>
/tmp/tmpygedczp5/handgun/images/armas_229.jpg
</path>
<size>
<width>
1800
</width>
<height>
1426
</height>
<depth>
3
</depth>
</size>
<segmented>
0
</segmented>
<object>
<name>
handgun
</name>
<pose>
Unspecified
</pose>
<truncated>
0
</truncated>
<difficult>
0
</difficult>
<bndbox>
<xmin>
1001
</xmin>
<ymin>
549
</ymin>
<xmax>
1453
</xmax>
<ymax>
1147
</ymax>
</bndbox>
</object>
</annotation>
Run Code Online (Sandbox Code Playgroud)
完成修改文档的几个值后,我将文档重写回文件,BeautifulSoup.prettify
如下所示:
<annotation>
<folder>Definitiva</folder>
<filename>armas_229.jpg</filename>
<path>/tmp/tmpygedczp5/handgun/images/armas_229.jpg</path>
<size>
<width>1800</width>
<height>1426</height>
<depth>3</depth>
</size>
<segmented>0</segmented>
<object>
<name>handgun</name>
<pose>Unspecified</pose>
<truncated>0</truncated>
<difficult>0</difficult>
<bndbox>
<xmin>1001</xmin>
<ymin>549</ymin>
<xmax>1453</xmax>
<ymax>1147</ymax>
</bndbox>
</object>
</annotation>
Run Code Online (Sandbox Code Playgroud)
我的假设是,BeautifulSoup.prettify
默认情况下添加这些多余/无偿的换行符,并且似乎没有一个好的方法来修改这种行为。我是否错过了 BeautifulSoup 文档中的某些内容,或者我真的无法修改此行为并且需要使用另一种方法将 XML 输出到文件?也许我最好用它xml.etree.ElementTree
来重写这个?
事实证明,如果我不使用xml.etree.ElementTree
BeautifulSoup 来代替 BeautifulSoup,就可以直接获得我想要的缩进。例如,下面的代码读取 XML 文件,清除文本元素中的所有换行符/空格,然后将树写入 XML 文件。
import argparse
from xml.etree import ElementTree
# ------------------------------------------------------------------------------
def reformat(
input_xml: str,
output_xml: str,
):
tree = ElementTree.parse(input_xml)
# remove extraneous newlines and whitespace from text elements
for element in tree.getiterator():
if element.text:
element.text = element.text.strip()
# write the updated XML into the annotations output directory
tree.write(output_xml)
# ------------------------------------------------------------------------------
if __name__ == "__main__":
# parse the command line arguments
args_parser = argparse.ArgumentParser()
args_parser.add_argument(
"--in",
required=True,
type=str,
help="file path of original XML",
)
args_parser.add_argument(
"--out",
required=True,
type=str,
help="file path of reformatted XML",
)
args = vars(args_parser.parse_args())
reformat(
args["in"],
args["out"],
)
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
5489 次 |
最近记录: |