Beautiful Soup 将 < 替换为 <

Question

Beautiful Soup 将 < 替换为 <

use*_*290 3 python beautifulsoup python-3.x

我找到了要替换的文本，但是当我打印时soup格式发生了变化。<div id="content">stuff here</div>变成<div id="content">stuff here</div>. 我怎样才能保存数据？我已经尝试过print(soup.encode(formatter="none"))，但这会产生相同的错误格式。

from bs4 import BeautifulSoup

with open(index_file) as fp:
    soup = BeautifulSoup(fp,"html.parser")

found = soup.find("div", {"id": "content"})
found.replace_with(data)

Run Code Online (Sandbox Code Playgroud)

当我打印时found，我得到正确的格式：

>>> print(found)
<div id="content">stuff</div>

Run Code Online (Sandbox Code Playgroud)

index_file内容如下：

 <!DOCTYPE html>
 <head>
    Apples 
 </head>
 <body>

   <div id="page">
    This is the Id of the page

  <div id="main">

     <div id="content">
       stuff here
     </div>
  </div>
 footer should go here
 </div>
</body>
</html>

Run Code Online (Sandbox Code Playgroud)

Answer 1

Mad*_*ist 7

该found对象不是 Python 字符串，它Tag只是碰巧有一个很好的字符串表示形式。您可以通过执行以下操作来验证这一点

type(found)

Run Code Online (Sandbox Code Playgroud)

ATag是 Beautiful Soup 创建的对象层次结构的一部分，以便您能够与 HTML 进行交互。另一个这样的对象是NavigableString. NavigableString很像字符串，但它只能包含将进入 HTML 内容部分的内容。

当你这样做时

found.replace_with('<div id="content">stuff here</div>')

Run Code Online (Sandbox Code Playgroud)

您要求Tag将替换为NavigableString包含该文字的 a 。HTML 能够显示该字符串的唯一方法是转义所有尖括号，正如它所做的那样。

为了避免这种混乱，您可能想保留您的Tag, 并仅替换它的内容：

found.string.replace_with('stuff here')

Run Code Online (Sandbox Code Playgroud)

请注意，正确的替换不会尝试覆盖标签。

当您这样做时found.replace_with(...)，名称引用的对象found将在父层次结构中被替换。然而，该名称found仍然指向与以前相同的过时对象。这就是为什么打印soup显示更新，但打印found不显示。

归档时间：	7 年，9 月前
查看次数：	2593 次
最近记录：	7 年，9 月前

Beautiful Soup 将 &lt; 替换为 &lt;

Beautiful Soup 将 < 替换为 <