解析RSS提要最近抛出了文档类型定义(DTD)错误

Rez*_*ian 5 c# xml rss parsing syndication-feed

这是最近开始困扰我的RSS提要解析器的错误.今天早上我的四个RSS源开始抛出此异常:

For security reasons DTD is prohibited in this XML document. To enable DTD processing set the DtdProcessing property on XmlReaderSettings to Parse and pass the settings into XmlReader.Create method.

以前的代码工作正常,但我相信这四个特定的rss feed已经发生了变化,导致了这个问题.使用DTD时使用DTD的东西,或者某种类型的架构更改,而我的SyndicationFeed无法解析.

所以我将代码更改为

string url = RssFeed.AbsoluteUri;
XmlReaderSettings st = new XmlReaderSettings();

st.DtdProcessing = DtdProcessing.Parse;
st.ValidationType = ValidationType.DTD;

XmlReader reader = XmlReader.Create(url,st);

SyndicationFeed feed = SyndicationFeed.Load(reader);

reader.Close();
Run Code Online (Sandbox Code Playgroud)

然后我开始收到这个错误:

The 'html' element is not declared.System.Xml.XmlValidatingReaderImpl.ValidationEventHandling.System.Xml.IValidationEventHandling.SendEvent(Exception exception, XmlSeverityType severity) at System.Xml.Schema.BaseValidator.SendValidationEvent(String code, String arg) at System.Xml.Schema.DtdValidator.ProcessElement() at System.Xml.Schema.DtdValidator.ValidateElement() at System.Xml.Schema.DtdValidator.Validate() at System.Xml.XmlValidatingReaderImpl.ProcessCoreReaderEvent() at System.Xml.XmlValidatingReaderImpl.Read() at System.Xml.XmlReader.MoveToContent() at System.Xml.XmlReader.IsStartElement(String localname, String ns) at System.ServiceModel.Syndication.Atom10FeedFormatter.CanRead(XmlReader reader) at System.ServiceModel.Syndication.SyndicationFeed.Load[TSyndicationFeed](XmlReader reader) at System.ServiceModel.Syndication.SyndicationFeed.Load(XmlReader reader)

我不知道这个'html'元素来自哪里,因为feed(http://jobs.huskyenergy.com/RSS)中的feed和任何可见的dtd定义都没有提到它.我也尝试过设置Dtdprocessing,DtdProcessing.ignore但会导致以下错误:

The element with name 'html' and namespace '' is not an allowed feed format.

这更令人困惑,因为命名空间是空白的,我不知道这个神被抛弃的html元素来自哪里.

我非常接近编写自己的xml阅读器并抓取SyndicationFeed,但是我想确保在走这条道路之前用尽所有可能的解决方案.

如果有任何帮助,可以使用其中一个RSS:http: //jobs.huskyenergy.com/RSS

pas*_*sty 3

下面是一个解决方案,它为/从给定的 RSS url 提供新的和填充的 SyndicateFeed 对象:

var feedUrl = @"http://jobs.huskyenergy.com/RSS";
try
{
    var webClient = new WebClient();
    // hide ;-)
    webClient.Headers.Add ("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");
    // fetch feed as string
    var content = webClient.OpenRead(feedUrl);
    var contentReader = new StreamReader(content);
    var rssFeedAsString = contentReader.ReadToEnd();
    // convert feed to XML using LINQ to XML and finally create new XmlReader object
    var feed = SyndicationFeed.Load(XDocument.Parse(rssFeedAsString).CreateReader());
    // take the info from the firdst feed entry
    var firstFeedItem = feed.Items.FirstOrDefault();
    Console.WriteLine(firstFeedItem.Title.Text);
    Console.WriteLine(firstFeedItem.Links.FirstOrDefault().Uri.AbsoluteUri);
}
catch (Exception exception)
{
    Console.WriteLine(exception.Message);
}
Run Code Online (Sandbox Code Playgroud)

该站点显然只处理来自“浏览器”的调用,因此隐藏了代码。呼叫作为一个整体。结果是:

Summer Student UEO Regulatory & Environment Strategy - (Calgary, AB)
http://jobs.huskyenergy.com/ca/alberta/student/jobid4444904-summer-student-ueo-regulatory--environment-strategy-jobs
Run Code Online (Sandbox Code Playgroud)

WebClient还支持事件和任务的异步操作,因此使读取器非阻塞是没有问题的。


html问题的解释如下:网站更改了某些内容和/或他们以某种方式不允许自动提要(不再)。html消息来自服务中断消息。我尝试访问该服务(使用 LINQ to XML 和 LINQPad,不要怀疑 Dump 功能):

var feedUrl = @"http://jobs.huskyenergy.com/RSS";
var feedContent = XDocument.Load(feedUrl);
feedContent.Dump();
//var feed = SyndicationFeed.Load(feedContent.CreateReader());
//feed.Dump();
Run Code Online (Sandbox Code Playgroud)

并得到这个答案:

<!DOCTYPE html []>
<!--[if IE 7]><html lang="en" prefix="og: http://ogp.me/ns#" class="non-js lt-ie9 lt-ie8"><![endif]-->
<!--[if IE 8]><html lang="en" prefix="og: http://ogp.me/ns#" class="non-js lt-ie9"><![endif]-->
<!--[if gt IE 8]><!-->
<html lang="en" prefix="og: http://ogp.me/ns#" class="non-js">
  <!--<![endif]-->
  <head>
    <meta charset="utf-8" />
    <meta name="viewport" content="width=device-width" />
    <title>
    Service Interruption
</title>
    <link rel="stylesheet" href="http://seostatic.tmp.com/SiteOutage/style.css" />
  </head>
  <body>
    <p id="outageMessage">This system is currently experiencing a service interruption. <br />We apologize for any inconvenience.</p>
  </body>
</html>
Run Code Online (Sandbox Code Playgroud)

这样 html 元素就显露出来了。:-) 在浏览器中打开该网站时,该网站看起来很好,这意味着 XmlReader 或 XmlReader。LINQ to XML 工作正常。