gha*_*hal 0 ruby screen-scraping watir nokogiri page-object-gem
所以,我有一个包含多行和多列的表.
<table>
<tr>
<th>Employee Name</th>
<th>Reg Hours</th>
<th>OT Hours</th>
</tr>
<tr>
<td>Employee 1</td>
<td>10</td>
<td>20</td>
</tr>
<tr>
<td>Employee 2</td>
<td>5</td>
<td>10</td>
</tr>
</table>
Run Code Online (Sandbox Code Playgroud)
还有另一张表:
<table>
<tr>
<th>Employee Name</th>
<th>Revenue</th>
</tr>
<td>Employee 2</td>
<td>$10</td>
</tr>
<tr>
<td>Employee 1</td>
<td>$50</td>
</tr>
</table>
Run Code Online (Sandbox Code Playgroud)
请注意,员工订单可能在表之间是随机的.
我如何使用nokogiri创建一个以每个员工为对象的json文件,以及他们的总小时数和收入?
目前,我只能使用一些xpath获取单个表格单元格.例如:
puts page.xpath(".//*[@id='UC255_tblSummary']/tbody/tr[2]/td[1]/text()").inner_text
Run Code Online (Sandbox Code Playgroud)
编辑:
使用页面对象gem和来自@Dave_McNulla的链接,我尝试了这段代码只是为了看看我得到了什么:
class MyPage
include PageObject
table(:report, :id => 'UC255_tblSummary')
def get_some_information
report_element[1][2].text
end
end
puts get_some_information
Run Code Online (Sandbox Code Playgroud)
然而,没有任何东西被归还.
数据:https://gist.github.com/anonymous/d8cc0524160d7d03d37b
小时表有一个副本.第一个很好.需要的另一个表是附件收入表.(我还需要激活表,但我会尝试将合并小时和附件收入表的代码合并.
我认为一般方法是:
为密钥为employee的每个表创建一个哈希
这部分你可以在Watir或Nokogiri做.如果Watir由于大桌子而表现不佳,那么使用Nokogiri才有意义.
的Watir:
#I assume you would have a better way to identify the tables than by index
hours_table = browser.table(:index, 0)
wage_table = browser.table(:index, 1)
#Turn the tables into a hash
employee_hours = {}
hours_table.trs.drop(1).each do |tr|
tds = tr.tds
employee_hours[ tds[0].text ] = {"Reg Hours" => tds[1].text, "OT Hours" => tds[2].text}
end
#=> {"Employee 1"=>{"Reg Hours"=>"10", "OT Hours"=>"20"}, "Employee 2"=>{"Reg Hours"=>"5", "OT Hours"=>"10"}}
employee_wage = {}
wage_table.trs.drop(1).each do |tr|
tds = tr.tds
employee_wage[ tds[0].text ] = {"Revenue" => tds[1].text}
end
#=> {"Employee 2"=>{"Revenue"=>"$10"}, "Employee 1"=>{"Revenue"=>"$50"}}
Run Code Online (Sandbox Code Playgroud)
引入nokogiri:
page = Nokogiri::HTML.parse(browser.html)
hours_table = page.search('table')[0]
wage_table = page.search('table')[1]
employee_hours = {}
hours_table.search('tr').drop(1).each do |tr|
tds = tr.search('td')
employee_hours[ tds[0].text ] = {"Reg Hours" => tds[1].text, "OT Hours" => tds[2].text}
end
#=> {"Employee 1"=>{"Reg Hours"=>"10", "OT Hours"=>"20"}, "Employee 2"=>{"Reg Hours"=>"5", "OT Hours"=>"10"}}
employee_wage = {}
wage_table.search('tr').drop(1).each do |tr|
tds = tr.search('td')
employee_wage[ tds[0].text ] = {"Revenue" => tds[1].text}
end
#=> {"Employee 2"=>{"Revenue"=>"$10"}, "Employee 1"=>{"Revenue"=>"$50"}}
Run Code Online (Sandbox Code Playgroud)
将两个表的结果合并在一起
您希望将两个哈希合并在一起,以便对于特定员工,哈希将包括他们的小时数和收入.
employee = employee_hours.merge(employee_wage){ |key, old, new| new.merge(old) }
#=> {"Employee 1"=>{"Revenue"=>"$50", "Reg Hours"=>"10", "OT Hours"=>"20"}, "Employee 2"=>{"Revenue"=>"$10", "Reg Hours"=>"5", "OT Hours"=>"10"}}
Run Code Online (Sandbox Code Playgroud)
转换为JSON
基于此前一个问题,您可以将哈希转换为json.
require 'json'
employee.to_json
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
1542 次 |
| 最近记录: |