BS4 replace_with 结果不再在树中

Question

BS4 replace_with 结果不再在树中

Nat*_*tan 7 python beautifulsoup replacewith

我需要替换 html 文档中的多个单词。Atm 我通过为每次替换调用一次 replace_with 来做到这一点。在 NavigableString 上调用 replace_with 两次会导致 ValueError（见下面的例子），因为被替换的元素不再在树中。

最小的例子

#!/usr/bin/env python3
from bs4 import BeautifulSoup
import re
def test1():
  html = \
  '''
    Identify
  '''
  soup = BeautifulSoup(html,features="html.parser")
  for txt in soup.findAll(text=True):
    if re.search('identify',txt,re.I) and txt.parent.name != 'a':
      newtext = re.sub('identify', '<a href="test.html"> test </a>', txt.lower())
      txt.replace_with(BeautifulSoup(newtext, features="html.parser"))
      txt.replace_with(BeautifulSoup(newtext, features="html.parser"))
      # I called it twice here to make the code as small as possible.
      # Usually it would be a different newtext ..
      # which was created using the replaced txt looking for a different word to replace.        

  return soup
print(test1())

Run Code Online (Sandbox Code Playgroud)

预期结果：

The txt is == newstring

Run Code Online (Sandbox Code Playgroud)

结果：

ValueError: Cannot replace one element with another when the element to be replaced is not
part of the tree.

Run Code Online (Sandbox Code Playgroud)

一个简单的解决方案就是修改新字符串，最后只替换一次，但我想了解当前的现象。

Answer 1

And*_*ely 5

第一个从文档树 ( doctxt.replace_with(...) ) 中删除NavigableString（此处存储在变量中）。这有效地设置为txttxt.parentNone

第二个txt.replace_with(...)查看parent属性，找到None（因为txt已经从树中删除）并抛出 ValueError。

正如您在问题末尾所说，解决方案之一是.replace_with()仅使用一次：

import re
from bs4 import BeautifulSoup

def test1():
    html = \
    '''
    word1 word2 word3 word4
    '''
    soup = BeautifulSoup(html,features="html.parser")

    to_delete = []
    for txt in soup.findAll(text=True):
        if re.search('word1', txt, flags=re.I) and txt.parent.name != 'a':
            newtext = re.sub('word1', '<a href="test.html"> test1 </a>', txt.lower())
            
            # ...some computations

            newtext = re.sub('word3', '<a href="test.html"> test2 </a>', newtext)

            # ...some more computations

            # and at the end, replce txt only once:
            txt.replace_with(BeautifulSoup(newtext, features="html.parser"))

    return soup
print(test1())

Run Code Online (Sandbox Code Playgroud)

印刷：

<a href="test.html"> test1 </a> word2 <a href="test.html"> test2 </a> word4

Run Code Online (Sandbox Code Playgroud)

归档时间：	5 年，3 月前
查看次数：	186 次
最近记录：	5 年，3 月前