What modules would be the best to write a python program that searches through hundreds of html documents and deletes a certain string of html that is given.
For instance, if I have an html doc that has <a href="test.html">Test</a> and I want to delete this out of every html page that has it.
Any help is much appreciated, and I don't need someone to write the program for me, just a helpful point in the right direction.
如果您要搜索的字符串将按字面意思在HTML中,那么简单的字符串替换就可以了:
old_html = open(html_file).read()
new_html = old_html.replace(my_string, "")
if new_html != old_html:
open(html_file, "w").write(new_html)
Run Code Online (Sandbox Code Playgroud)
作为字符串不在字面上的字符串的示例,假设您正在寻找"测试",正如您所说.你想要它匹配HTML的这些片段吗?:
<a href='test.html'>Test</a>
<A HREF='test.html'>Test</A>
<a href="test.html" class="external">Test</a>
<a href="test.html">Test</a>
Run Code Online (Sandbox Code Playgroud)
等等:"相同的"HTML可以用许多不同的方式表达.如果您知道HTML中使用的精确字符,那么简单的字符串替换就可以了.如果你需要在HTML语义级别进行匹配,那么你需要使用更高级的工具,比如BeautifulSoup,但是你也可能有非常不同的HTML输出,即使在不受删除影响的部分也是如此,因为整个文件将被解析和重组.
要在许多文件上执行代码,您会发现os.path.walk在树中查找文件或glob.glob将文件名匹配到类似shell的通配符模式非常有用.
| 归档时间: |
|
| 查看次数: |
1300 次 |
| 最近记录: |