Nai*_*aba 5 python whitespace beautifulsoup
BeautifulSoup 正在删除换行符标签之前的空格:
print BeautifulSoup("<?xml version='1.0' encoding='UTF-8'?><section> \n</section>")
Run Code Online (Sandbox Code Playgroud)
上面的代码打印:
<?xml version="1.0" encoding="utf-8"?>
<section>
</section>
Run Code Online (Sandbox Code Playgroud)
请注意,节标记后面的四个空格丢失了!有趣的是,如果我这样做:
print BeautifulSoup("<?xml version='1.0' encoding='UTF-8'?><section>a \n</section>")
Run Code Online (Sandbox Code Playgroud)
我得到:
<?xml version="1.0" encoding="utf-8"?>
<section>a
</section>
Run Code Online (Sandbox Code Playgroud)
“a”后面的四个空格现在出现了!如何在原始打印语句中显示四个空格?
作为解决方法,您可以尝试在解析之前将所有<section>...</section>内容替换为<pre>...</section>。BeautifulSoup 将完全保留这些空间。例如:
from bs4 import BeautifulSoup
import re
html = "<?xml version='1.0' encoding='UTF-8'?><section> \n</section>"
html = re.sub(r'(\</?)(section)(\>)', r'\1pre\3', html)
soup = BeautifulSoup(html, "lxml")
print repr(soup.pre.text) # repr used to show where the spaces are
Run Code Online (Sandbox Code Playgroud)
给你:
u' \n'
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
1429 次 |
| 最近记录: |