Ton*_*shi 5 ruby ruby-on-rails nokogiri
我遇到了这个HTML:
<div class='featured'>
<h1>
How to extract this?
<span>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</span>
<span class="moredetail ">
<a href="/hello" title="hello">hello</a>
</span>
<div class="clear"></div>
</h1>
</div>
Run Code Online (Sandbox Code Playgroud)
我想提取<h1>
文本" How to extract this?
".我该怎么办?
我尝试使用以下代码,但附加了其他元素.我不确定如何排除它们,所以我只得到<h1>
文本本身.
doc = Nokogiri::HTML(open(url))
records = doc.css(".featured h1")
Run Code Online (Sandbox Code Playgroud)
#css
返回一个集合,用于#at_css
获取第一个匹配的节点.它的所有内容,甚至文本都是儿童,在这种情况下,文本是它的第一个孩子.children.reject &element?
如果你想要所有不是元素的孩子,你也可以做一些事情.
data = '
<div class="featured">
<h1>
How to extract this?
<span>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</span>
<span class="moredetail ">
<a href="/hello" title="hello">hello</a>
</span>
<div class="clear"></div>
</h1>
</div>
'
require 'nokogiri'
text = Nokogiri::HTML(data).at_css('.featured h1').children.first.text
text # => "\n How to extract this?\n "
Run Code Online (Sandbox Code Playgroud)
或者,您可以使用xpath:
Nokogiri::HTML(data).at_xpath('//*[@class="featured"]/h1/text()').text
Run Code Online (Sandbox Code Playgroud)