使用 nokogiri 提取 HTML 标签之间的文本

Question

使用 nokogiri 提取 HTML 标签之间的文本

我有这样的 HTML：

<h1> Header is here</h1>
  <h2>Header 2 is here</h2>
     <p> Extract me!</p>
     <p> Extract me too!</p>
  <h2> Next Header 2</h2>
     <p>not interested</p>
     <p>not interested</p>
  <h2>Header 2 is here</h2>
     <p> Extract me!</p>
     <p> Extract me too!</p>

Run Code Online (Sandbox Code Playgroud)

我有一个基本的 Nokogiri CSS 节点搜索返回 <p> 内容，但我找不到如何定位第 N 个关闭的 H2 和下一个打开的 H2 之间的所有文本的示例。我正在使用输出创建 CSV，因此我还想读取文件列表并将 URL 作为第一个结果。

Answer 1

Dan*_*aly 3

require 'rubygems'
require 'nokogiri'

h = '<h1> Header is here</h1>
  <h2>Header 2 is here</h2>
     <p> Extract me!</p>
     <p> Extract me too!</p>
  <h2> Next Header 2</h2>
     <p>not interested</p>
     <p>not interested</p>
  <h2>Header 2 is here</h2>
     <p> Extract me!</p>
     <p> Extract me too!</p>
'

doc = Nokogiri::HTML(h)

# Specify the range between delimiter tags that you want to extract
# triple dot is used to exclude the end point
# 1...2 means 1 and not 2
EXTRACT_RANGES = [
  2...3,
  4...5
]

# Tags which count as delimiters, not to be extracted
DELIMITER_TAGS = [
  "h1",
  "h2"
]

extracted_text = []

i = 0
# Change /"html"/"body" to the correct path of the tag which contains this list
(doc/"html"/"body").children.each do |el|

  if (DELIMITER_TAGS.include? el.name)
    i += 1
  else
    extract = false
    EXTRACT_RANGES.each do |cur_range|
      if (cur_range.include? i)
        extract = true
        break
      end
    end

    if extract
      s = el.inner_text.strip
      unless s.empty?
        extracted_text << el.inner_text.strip
      end
    end
  end

end

# Print out extracted text (each element's inner text is separated by newlines)
puts extracted_text.join("\n")

Run Code Online (Sandbox Code Playgroud)

归档时间：	14 年，3 月前
查看次数：	5817 次
最近记录：	14 年，3 月前