将HTML转换为纯文本(包含<br>)

not*_*ere 10 ruby nokogiri

是否可以将带有Nokogiri的HTML转换为纯文本?我也想要包含<br />标签.

例如,给定此HTML:

<p>ala ma kota</p> <br /> <span>i kot to idiota </span>
Run Code Online (Sandbox Code Playgroud)

我想要这个输出:

ala ma kota
i kot to idiota
Run Code Online (Sandbox Code Playgroud)

当我打电话时,Nokogiri::HTML(my_html).text它不包括<br />标签:

ala ma kota i kot to idiota
Run Code Online (Sandbox Code Playgroud)

not*_*ere 17

我没有写复杂的正则表达式,而是使用了Nokogiri.

工作解决方案(KISS!):

def strip_html(str)
  document = Nokogiri::HTML.parse(str)
  document.css("br").each { |node| node.replace("\n") }
  document.text
end
Run Code Online (Sandbox Code Playgroud)


Phr*_*ogz 8

默认情况下不会出现这样的情况,但您可以轻松地将所需内容放在一起,接近所需的输出:

require 'nokogiri'
def render_to_ascii(node)
  blocks = %w[p div address]                      # els to put newlines after
  swaps  = { "br"=>"\n", "hr"=>"\n#{'-'*70}\n" }  # content to swap out
  dup = node.dup                                  # don't munge the original

  # Get rid of superfluous whitespace in the source
  dup.xpath('.//text()').each{ |t| t.content=t.text.gsub(/\s+/,' ') }

  # Swap out the swaps
  dup.css(swaps.keys.join(',')).each{ |n| n.replace( swaps[n.name] ) }

  # Slap a couple newlines after each block level element
  dup.css(blocks.join(',')).each{ |n| n.after("\n\n") }

  # Return the modified text content
  dup.text
end

frag = Nokogiri::HTML.fragment "<p>It is the end of the world
  as         we
  know it<br>and <i>I</i> <strong>feel</strong>
  <a href='blah'>fine</a>.</p><div>Capische<hr>Buddy?</div>"

puts render_to_ascii(frag)
#=> It is the end of the world as we know it
#=> and I feel fine.
#=> 
#=> Capische
#=> ----------------------------------------------------------------------
#=> Buddy?
Run Code Online (Sandbox Code Playgroud)