想象一下,我有一个带有内容的文件test.html,
<?xml version="1.0" encoding="UTF-8" standalone="no"?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><title>Components of the SDK</title><link rel="stylesheet" href="core.css" type="text/css"/><meta name="generator" content="DocBook XSL Stylesheets V1.74.0"/></head><body></body></html>
Run Code Online (Sandbox Code Playgroud)
并在python提示符中执行此操作,
>>>import lxml.html
>>>t = lxml.html.parse('test.html')
>>>lxml.html.etree.tostring(t)
>>>'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">\n<?xml version="1.0" encoding="UTF-8" standalone="no"??><html xmlns="http://www.w3.org/1999/xhtml"><head><title>Components of the SDK</title><link rel="stylesheet" href="core.css" type="text/css"/><meta name="generator" content="DocBook XSL Stylesheets V1.74.0"/></head><body/></html>'
Run Code Online (Sandbox Code Playgroud)
请注意在lxml读入数据然后再通过tostring将其打印出来后,doctype和xml标签是如何反转的?我们如何修复它以便它不会尝试修改文档(假设它已经很好地形成).