Searching for specific HTML string using Python

Gab*_*abe 1 html python

What modules would be the best to write a python program that searches through hundreds of html documents and deletes a certain string of html that is given. For instance, if I have an html doc that has <a href="test.html">Test</a> and I want to delete this out of every html page that has it.

Any help is much appreciated, and I don't need someone to write the program for me, just a helpful point in the right direction.

Ned*_*der 5

如果您要搜索的字符串将按字面意思在HTML中,那么简单的字符串替换就可以了:

old_html = open(html_file).read()
new_html = old_html.replace(my_string, "")
if new_html != old_html:
    open(html_file, "w").write(new_html)
Run Code Online (Sandbox Code Playgroud)

作为字符串不在字面上的字符串的示例,假设您正在寻找"测试",正如您所说.你想要它匹配HTML的这些片段吗?:

<a href='test.html'>Test</a>
<A HREF='test.html'>Test</A>
<a href="test.html" class="external">Test</a>
<a href="test.html">Tes&#116;</a>
Run Code Online (Sandbox Code Playgroud)

等等:"相同的"HTML可以用许多不同的方式表达.如果您知道HTML中使用的精确字符,那么简单的字符串替换就可以了.如果你需要在HTML语义级别进行匹配,那么你需要使用更高级的工具,比如BeautifulSoup,但是你也可能有非常不同的HTML输出,即使在不受删除影响的部分也是如此,因为整个文件将被解析和重组.

要在许多文件上执行代码,您会发现os.path.walk在树中查找文件或glob.glob将文件名匹配到类似shell的通配符模式非常有用.