use*_*097 11 ruby screen-scraping ruby-on-rails nokogiri ruby-on-rails-3.1
我已经尝试使用Sanitizegem来清理包含网站HTML的字符串.
它只删除了<script>标签,而不是脚本标签内的JavaScript.
我可以用什么来从页面中删除JavaScript?
Phr*_*ogz 13
require 'open-uri' # included with Ruby; only needed to load HTML from a URL
require 'nokogiri' # gem install nokogiri read more at http://nokogiri.org
html = open('http://stackoverflow.com') # Get the HTML source string
doc = Nokogiri.HTML(html) # Parse the document
doc.css('script').remove # Remove <script>…</script>
puts doc # Source w/o script blocks
doc.xpath("//@*[starts-with(name(),'on')]").remove # Remove on____ attributes
puts doc # Source w/o any JavaScript
Run Code Online (Sandbox Code Playgroud)
我偏爱丝瓜宝石.从文档中的示例修改:
1.9.3p0 :005 > Loofah.fragment("<span onclick='foo'>hello</span> <script>alert('OHAI')</script>").scrub!(:prune).to_s
=> "<span>hello</span> "
Run Code Online (Sandbox Code Playgroud)
您可能对Loofah提供的ActiveRecord扩展感兴趣.
事实证明,Sanitize内置了一个选项(只是没有详细记录)......
Sanitize.clean(content, :remove_contents => ['script', 'style'])
Run Code Online (Sandbox Code Playgroud)
这删除了我想要的所有脚本和样式标签(及其内容).
| 归档时间: |
|
| 查看次数: |
6742 次 |
| 最近记录: |