如何以这种特殊方式解析此Craigslist页面？

Question

如何以这种特殊方式解析此Craigslist页面？

这是有问题的页面:http://phoenix.craigslist.org/cpg/

我想要做的是创建一个如下所示的数组:

日期(由该页面上的h4标记捕获)=> in cell [0][0][0],
Link Text => in cell [0][1][0]
Link href => in cell[0][1][1]

即在每一行中,我每行存储这些项目.

我所做的只是将所有h4标签拉入并将其存储在这样的哈希中:

contents2[link[:date]] = content_page.css("h4").text

Run Code Online (Sandbox Code Playgroud)

这个问题是一个单元格存储整个页面上h4标签的所有文本...而我希望1个单元格有1个日期.

举个例子:

0 => Mon May 28 - Leads need follow up - (Phoenix) - http://phoenix.craigslist.org/wvl/cpg/3043296202.html
1=> Mon May 28 - .Net/Java Developers - (phoenix) - http://phoenix.craigslist.org/cph/cpg/3043067349.html

Run Code Online (Sandbox Code Playgroud)

任何关于我如何处理这个问题的想法都会非常感激.

Answer 1

Cas*_*per 3

这个怎么样？

require 'rubygems'
require 'open-uri'
require 'nokogiri'

doc = Nokogiri::HTML(open("http://phoenix.craigslist.org/cpg/"))

# Postings start inside the second blockquote on the page
bq = doc.xpath('//blockquote')[1]

date  = nil         # Temp store of date of postings
posts = Array.new   # Store array of all postings here

# Loop through all blockquote children collecting data as we go along...
bq.children.each { |nod|
  # The date is stored in the h4 nodes. Grab it from there.
  date = nod.text if nod.name == "h4"

  # Skip nodes until we have a date
  next if !date

  # Skip nodes that are not p blocks. The p blocks contain the postings.
  next if nod.name != "p"

  # We have a p block. Extract posting data.
  link = nod.css('a').first['href']
  text = nod.text

  # Add new posting to array
  posts << [date, text, link]
}

# Output everything we just collected
posts.each { |p| puts p.join(" - ") }

Run Code Online (Sandbox Code Playgroud)

归档时间：	13 年，11 月前
查看次数：	404 次
最近记录：	13 年，11 月前