解析图像网址nokogiri

Question

解析图像网址nokogiri

我需要解析HTML中的图像URL,如下所示:

<p><a href="http://blog.website.com/wp-content/uploads/2012/02/image_name.jpg" ><img class="aligncenter size-full wp-image-12313" alt="Example image Name" src="http://blog.website.com/wp-content/uploads/2012/02/image_name.jpg" width="630" height="119" /></a></p>

Run Code Online (Sandbox Code Playgroud)

到目前为止,我使用Nokogiri解析<h2>标签:

require 'rubygems'
require 'nokogiri'
require 'open-uri'

page = Nokogiri::HTML(open("http://blog.website.com/"))
headers = page.css('h2')

puts headers.text

Run Code Online (Sandbox Code Playgroud)

我有两个问题:

我该如何解析图片网址？
理想情况下,我会以这种格式打印到控制台:

 1. 
Header 1
image_url 1
image_url 2 (if any)
 2. 
Header 2
2image_url 1
2image_url 2 (if any)

到目前为止,我还没能用这种漂亮的格式打印标题.我怎么能这样做？

<h2><a href="http://blog.website.com/2013/02/15/images/" rel="bookmark" title="Permanent Link to Blog Post">Blog Post</a></h2>
          <p class="post_author"><em>by</em> author</p>
          <div class="format_text">
    <p style="text-align: left;">Blog Content </p>
<p style="text-align: left;"> Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. </p>
<p style="text-align: center;"><a href="http://blog.website.com/wp-content/uploads/2012/02/image21.jpg" ><img class="alignnone size-full wp-image-23382" alt="image2" src="http://blog.website.com/wp-content/uploads/2012/02/image21.jpg" width="630" height="210" /></a></p>
<p style="text-align: left;">Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. </p>
<p style="text-align: center;"><b id="internal-source-marker_0.054238131968304515">Items: <a href="http://www.website.com/threads?src=login#/show/thread/A_abvaf812e3"  target="_blank">Items for Spring</a></b></p>
<p style="text-align: center;">Lorem Ipsum.</p>
<p style="text-align: center;"><b id="internal-source-marker_0.054238131968304515">More Items: <a href="http://www.website.com/threads#/show/thread/A_abv2a6822e2"  target="_blank">Lorem Ipsum</a></b></p>
<p style="text-align: center;">Lorem Ipsum.</p>
<p style="text-align: center;"><b id="internal-source-marker_0.054238131968304515">Still more items: <a href="http://www.website.com/threads#/show/thread/A_abv7af882e3"  target="_blank">Items:</a></b></p>
<p style="text-align: center;">Lorem Ipsum.</p>
<p style="text-align: center;"><b id="internal-source-marker_0.054238131968304515">Lorem ipsum: <a href="http://www.website.com/threads?src=login#/show/thread/A_abvea6832e8"  target="_blank">Items</a></b></p>
<p style="text-align: center;">Lorem Ipusm</p>
<p style="text-align: center;"><b id="internal-source-marker_0.054238131968304515">
        </div>  
          <p class="to_comments"><span class="date">February 15, 2013</span> &nbsp; <span class="num_comments"><a href="http://blog.website.com/2013/02/15/Blog-post/#respond" title="Comment on Blog Post">No Comments</a></span></p>

Run Code Online (Sandbox Code Playgroud)

Answer 1

pgu*_*rio 6

我认为首先按h2分组更有意义:

doc.search('h2').each_with_index do |h2, i|
  puts "#{i+1}."
  puts h2.text
  h2.search('+ p + div > p[3] img').each do |img|
    puts img['src']
  end
end

Run Code Online (Sandbox Code Playgroud)

Answer 2

Mar*_*mas 5

要获取图像,只需查找img带有src属性的标记即可.

如果您想要h2与每个图像关联,您可以这样做:

doc.xpath('//img').each do |img|
  puts "Header: #{img.xpath('preceding::h2[1]').text}"
  puts "  Image: #{img['src']}"
end

Run Code Online (Sandbox Code Playgroud)

请注意,切换到XPath是为了preceding::轴.

编辑

要按标题分组,您可以将它们放在哈希中:

headers = Hash.new{|h,k| h[k] = []}
doc.xpath('//img').each do |img|
  header = img.xpath('preceding::h2[1]').text
  image = img['src']
  headers[header] << image
end

Run Code Online (Sandbox Code Playgroud)

为了得到你规定的输出:

headers.each do |h,urls|
  puts "#{h} #{urls.join(' ')}"
end

Run Code Online (Sandbox Code Playgroud)

归档时间：	12 年，9 月前
查看次数：	6715 次
最近记录：	12 年前