JRB*_*JRB 2 c# screen-scraping html-parsing html-agility-pack
作为编程新手,我阅读了大量的示例代码并尝试将各种东西混合在一起以试图了解哪些有用.我正在使用html agility pack试图废弃新闻网页.
问题:我测试的其中一个节点不使用静态值,它使用查看时间.如何将此应用于switch {case}方法.如果我在整个方法中偏离基础,我也愿意接受任何建议.
另请注意:我不需要捕获此节点,如果有一种方法可以跳过它对我有用.
我决定使用一个使用开关的exapmle,
var rows = doc.DocumentNode.SelectNodes(".//*[@id='weekdays']/tr");
foreach (var row in rows)
{
var cells = row.SelectNodes("./td");
string title = cells[0].InnerText;
var valueRow = cells[2];
switch (title)
{
case "Date":
HtmlNode date = valueRow.SelectSingleNode("//*[starts-with(@id, 'detail_row_seek')]/td");
Console.WriteLine("UPC=A:\t" + date.InnerText);
break;
case "":
string Time = valueRow.InnerText;
Console.WriteLine("Time:\t" + Time);
break;
case "News":
string Time = valueRow.InnerText;
Console.WriteLine("News:\t" + News);
break;
}
Run Code Online (Sandbox Code Playgroud)
摘录html
<table id="weekdays" cellpadding="6" cellspacing="0" border="0" width="100%">
<tr>
<td class="thead" style="border-bottom: 1px solid #d1d1e1;font-weight:normal; text-align: center; width:8%; padding-left: 6px;">Date</td>
<td class="thead" style="border-bottom: 1px solid #d1d1e1;font-weight:normal; width:8%; text-align: center; white-space:nowrap"><a href="guestcp.php?do=customoptions" title="Time & Date Options"><img style="position:relative; vertical-align: bottom;" src="images/misc/clock_small.gif" title="Time & Date Options" alt="Time & Date Options" border="0" /></a><a href="guestcp.php?do=customoptions" title="Time & Date Options"><span id="ff_nowtime_clock">3:20pm</span></a></td>
<td class="thead" style="border-bottom: 1px solid #d1d1e1;font-weight:normal; text-align: center; width:8%;">News</td>
Run Code Online (Sandbox Code Playgroud)
.........
<tr id="detail_row_seek_37876">
<td id="toprow_9" class="alt1 espace" rowspan="3" style="vertical-align: top; text-align: center;" nowrap="nowrap">
<span class="smallfont">
<div>Sat</div>
Apr 9
</span>
</td>
<td class="alt1 espace" style="text-align: center;" nowrap="nowrap">
<span class="smallfont">Day 3</span>
</td>
<td class="alt1 espace" style="text-align: center;"><span class="smallfont">EUR</span></td>
<td class="alt1 espace" style="padding-top: 2px" align="center">
<a name="chart=37876" style="position:absolute; margin-top: -10px;"></a><a name="details=37876" style="position:absolute; margin-top: -10px;"></a>
<div class="cal_imp_medium" title="Medium Impact Expected"></div></td>
<td class="alt1 espace">
<div class="smallfont" id="title_37876" style="padding-left: 11px;">ECOFIN Meetings</div>
</td>
Run Code Online (Sandbox Code Playgroud)
问题是:所谓的时间列不是静态的,它实际使用时间值.有没有办法在案件中使用外卡或者做一个"包含"来解决这个非常冗长的问题?
您必须在switch语句的每种情况下使用常量值.
我能想到的,你做你找什么做的唯一方法就是使用default:情况-在此情况下,默认情况下就可以测试你正在寻找利用价值contains,Parse或者Regex使用测试if.
我无法完全遵循您的HTML示例代码(抱歉!) - 但修改后的C#可能看起来像:
switch (title)
{
case "Date":
HtmlNode date = valueRow.SelectSingleNode("//*[starts-with(@id, 'detail_row_seek')]/td");
Console.WriteLine("UPC=A:\t" + date.InnerText);
break;
case "News":
string News = valueRow.InnerText;
Console.WriteLine("News:\t" + News);
break;
default:
if (regexTime.Match(title))
{
string Time = valueRow.InnerText;
Console.WriteLine("Time:\t" + Time);
}
break;
}
Run Code Online (Sandbox Code Playgroud)