如何使用Ruby的扫描方法来解析HTML表?

0 html ruby regex

我正在尝试获取一个HTML表并创建一个数组数组,每个数组都是一行,并且数组中的每个元素都是一个单元格.假设我可以将整个表分成行,我想用<td>标签分割每一行.我有以下内容:

def get_cells(one_row)
cells = one_row.scan(/<td>.+?<\/td>/)
for c in cells 
    puts c
end
end
Run Code Online (Sandbox Code Playgroud)

这是我正在处理的HTML,作为一个名为的字符串one_row:

<tr>
<td>1990</td>
<td>1991</td>
<td><a href="/wiki/Gulf_War">Gulf War</a></td>
<td><span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/a/aa/Flag_of_Kuwait.svg/22px-Flag_of_Kuwait.svg.png" width="22" height="11" class="thumbborder" />&#160;</span><a href="/wiki/Kuwait">Kuwait</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/a/a4/Flag_of_the_United_States.svg/22px-Flag_of_the_United_States.svg.png" width="22" height="12" class="thumbborder" />&#160;</span><a href="/wiki/United_States">United States</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/0/0d/Flag_of_Saudi_Arabia.svg/22px-Flag_of_Saudi_Arabia.svg.png" width="22" height="15" class="thumbborder" />&#160;</span><a href="/wiki/Saudi_Arabia">Saudi Arabia</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/a/ae/Flag_of_the_United_Kingdom.svg/22px-Flag_of_the_United_Kingdom.svg.png" width="22" height="11" class="thumbborder" />&#160;</span><a href="/wiki/United_Kingdom">United Kingdom</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/f/fe/Flag_of_Egypt.svg/22px-Flag_of_Egypt.svg.png" width="22" height="15" class="thumbborder" />&#160;</span><a href="/wiki/Egypt">Egypt</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/c/c3/Flag_of_France.svg/22px-Flag_of_France.svg.png" width="22" height="15" class="thumbborder" />&#160;</span><a href="/wiki/France">France</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/5/53/Flag_of_Syria.svg/22px-Flag_of_Syria.svg.png" width="22" height="15" class="thumbborder" />&#160;</span><a href="/wiki/Syria">Syria</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/2/2c/Flag_of_Morocco.svg/22px-Flag_of_Morocco.svg.png" width="22" height="15" class="thumbborder" />&#160;</span><a href="/wiki/Morocco">Morocco</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Flag_of_Oman.svg/22px-Flag_of_Oman.svg.png" width="22" height="11" class="thumbborder" />&#160;</span><a href="/wiki/Oman">Oman</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/3/32/Flag_of_Pakistan.svg/22px-Flag_of_Pakistan.svg.png" width="22" height="15" class="thumbborder" />&#160;</span><a href="/wiki/Pakistan">Pakistan</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/c/cf/Flag_of_Canada.svg/22px-Flag_of_Canada.svg.png" width="22" height="11" class="thumbborder" />&#160;</span><a href="/wiki/Canada">Canada</a><br />
<a href="/wiki/Coalition_of_Gulf_War" title="Coalition of Gulf War" class="mw-redirect">Other Coalition Forces</a></td>
<td><span class="flagicon"><a href="/wiki/Iraq" title="Iraq"><img alt="Iraq" src="http://upload.wikimedia.org/wikipedia/commons/thumb/0/04/Flag_of_Iraq_%281963-1991%29.svg/22px-Flag_of_Iraq_%281963-1991%29.svg.png" width="22" height="15" class="thumbborder" /></a></span> <a href="/wiki/Baathist_Iraq" title="Baathist Iraq">Iraq</a></td>
</tr>
Run Code Online (Sandbox Code Playgroud)

但是,当我在此调用get_cells时,它不会返回包含五个元素的数组.它返回一个包含四个元素的数组:

<td>1990</td>
<td>1991</td>
<td><a href="/wiki/Gulf_War">Gulf War</a></td>
<td><span class="flagicon"><a href="/wiki/Iraq" title="Iraq"><img alt="Iraq" src="http://upload.wikimedia.org/wikipedia/commons/thumb/0/04/Flag_of_Iraq_%281963-1991%29.svg/22px-Flag_of_Iraq_%281963-1991%29.svg.png" width="22" height="15" class="thumbborder" /></a></span> <a href="/wiki/Baathist_Iraq" title="Baathist Iraq">Iraq</a></td>
Run Code Online (Sandbox Code Playgroud)

它似乎正在跳过应该是第四个细胞的东西.该单元格包含许多元素,所有元素都以换行符分隔.这可能是什么搞乱了这个?有关如何处理此问题的任何建议?

Chu*_*uck 5

HTML超出了正则表达式的可靠解析能力 - 即使在简单的caes中,它也几乎不值得花时间.如果您需要解析HTML,只需使用像Hpricot或Nokogiri这样的HTML解析器.例如,Nokogiri(text).css('td').count给出5,并Nokogiri(text).css('td').map(&:text)给出["1990", "1991", "Gulf War", " Kuwait  United States  Saudi Arabia  United Kingdom  Egypt  France  Syria  Morocco  Oman  Pakistan  Canada Other Coalition Forces", " Iraq"].