BeautifulSoup python解析html文件

Question

BeautifulSoup python解析html文件

我使用BeautifulSoup替换html文件中的所有逗号&sbquo;.这是我的代码:

f = open(sys.argv[1],"r")
data = f.read()

soup = BeautifulSoup(data)

comma = re.compile(',') 


for t in soup.findAll(text=comma):
        t.replaceWith(t.replace(',', '&sbquo;'))

Run Code Online (Sandbox Code Playgroud)

此代码有效,除非html文件中包含一些javascript.在这种情况下,它甚至用javascript代码替换逗号(,).这不是必需的.我只想替换html文件的所有文本内容.

Answer 1

Sea*_*ira 5

soup.findall 可以赎罪:

tags_to_skip = set(["script", "style"])
# Add to this list as needed

def valid_tags(tag):
    """Filter tags on the basis of their tag names

    If the tag name is found in ``tags_to_skip`` then
    the tag is dropped.  Otherwise, it is kept.
    """
    if tag.source.name.lower() not in tags_to_skip:
        return True
    else:
        return False

for t in soup.findAll(valid_tags):
    t.replaceWith(t.replace(',', '&sbquo;'))

Run Code Online (Sandbox Code Playgroud)

归档时间：	14 年，4 月前
查看次数：	4154 次
最近记录：	14 年，4 月前