从HTML页面中删除所有JavaScript

use*_*097 11 ruby screen-scraping ruby-on-rails nokogiri ruby-on-rails-3.1

我已经尝试使用Sanitizegem来清理包含网站HTML的字符串.

它只删除了<script>标签,而不是脚本标签内的JavaScript.

我可以用什么来从页面中删除JavaScript?

Phr*_*ogz 13

require 'open-uri'      # included with Ruby; only needed to load HTML from a URL
require 'nokogiri'      # gem install nokogiri   read more at http://nokogiri.org

html = open('http://stackoverflow.com')              # Get the HTML source string
doc = Nokogiri.HTML(html)                            # Parse the document

doc.css('script').remove                             # Remove <script>…</script>
puts doc                                             # Source w/o script blocks

doc.xpath("//@*[starts-with(name(),'on')]").remove   # Remove on____ attributes
puts doc                                             # Source w/o any JavaScript
Run Code Online (Sandbox Code Playgroud)


the*_*Man 6

我偏爱丝瓜宝石.从文档中的示例修改:

1.9.3p0 :005 > Loofah.fragment("<span onclick='foo'>hello</span> <script>alert('OHAI')</script>").scrub!(:prune).to_s
 => "<span>hello</span> " 
Run Code Online (Sandbox Code Playgroud)

您可能对Loofah提供的ActiveRecord扩展感兴趣.


use*_*097 6

事实证明,Sanitize内置了一个选项(只是没有详细记录)......

Sanitize.clean(content, :remove_contents => ['script', 'style'])
Run Code Online (Sandbox Code Playgroud)

这删除了我想要的所有脚本和样式标签(及其内容).