chu*_*ley 5 ruby xpath nokogiri
我有这样的 HTML:
<h1> Header is here</h1>
<h2>Header 2 is here</h2>
<p> Extract me!</p>
<p> Extract me too!</p>
<h2> Next Header 2</h2>
<p>not interested</p>
<p>not interested</p>
<h2>Header 2 is here</h2>
<p> Extract me!</p>
<p> Extract me too!</p>
Run Code Online (Sandbox Code Playgroud)
我有一个基本的 Nokogiri CSS 节点搜索返回 <p> 内容,但我找不到如何定位第 N 个关闭的 H2 和下一个打开的 H2 之间的所有文本的示例。我正在使用输出创建 CSV,因此我还想读取文件列表并将 URL 作为第一个结果。
require 'rubygems'
require 'nokogiri'
h = '<h1> Header is here</h1>
<h2>Header 2 is here</h2>
<p> Extract me!</p>
<p> Extract me too!</p>
<h2> Next Header 2</h2>
<p>not interested</p>
<p>not interested</p>
<h2>Header 2 is here</h2>
<p> Extract me!</p>
<p> Extract me too!</p>
'
doc = Nokogiri::HTML(h)
# Specify the range between delimiter tags that you want to extract
# triple dot is used to exclude the end point
# 1...2 means 1 and not 2
EXTRACT_RANGES = [
2...3,
4...5
]
# Tags which count as delimiters, not to be extracted
DELIMITER_TAGS = [
"h1",
"h2"
]
extracted_text = []
i = 0
# Change /"html"/"body" to the correct path of the tag which contains this list
(doc/"html"/"body").children.each do |el|
if (DELIMITER_TAGS.include? el.name)
i += 1
else
extract = false
EXTRACT_RANGES.each do |cur_range|
if (cur_range.include? i)
extract = true
break
end
end
if extract
s = el.inner_text.strip
unless s.empty?
extracted_text << el.inner_text.strip
end
end
end
end
# Print out extracted text (each element's inner text is separated by newlines)
puts extracted_text.join("\n")
Run Code Online (Sandbox Code Playgroud)