p.c*_*ell 8 c# regex linq parsing linq-to-xml
尝试解析HTML文档并提取一些元素(文本文件的任何链接).
当前的策略是将HTML文档加载到字符串中.然后找到文本文件链接的所有实例.它可以是任何文件类型,但对于这个问题,它是一个文本文件.
最终目标是拥有一个IEnumerable字符串对象列表.这部分很简单,但解析数据是个问题.
<html>
<head><title>Blah</title>
</head>
<body>
<br/>
<div>Here is your first text file: <a href="http://myServer.com/blah.txt"></div>
<span>Here is your second text file: <a href="http://myServer.com/blarg2.txt"></span>
<div>Here is your third text file: <a href="http://myServer.com/bat.txt"></div>
<div>Here is your fourth text file: <a href="http://myServer.com/somefile.txt"></div>
<div>Thanks for visiting!</div>
</body>
</html>
Run Code Online (Sandbox Code Playgroud)
最初的方法是:
href=并以...结尾的字符串.txt问题是:
这是一个使用Jeff建议的正则表达式的C#控制台应用程序 .它读取字符串很好,并且不包括任何未以.txt结尾的href.对于给定的示例,它正确地不包括.txt.snarg结果中的文件(如HTML字符串函数中所提供的).
using System;
using System.Collections.Generic;
using System.Text;
using System.Text.RegularExpressions;
using System.IO;
namespace ParsePageLinks
{
class Program
{
static void Main(string[] args)
{
GetAllLinksFromStringByRegex();
}
static List<string> GetAllLinksFromStringByRegex()
{
string myHtmlString = BuildHtmlString();
string txtFileExp = "href=\"([^\\\"]*\\.txt)\"";
List<string> foundTextFiles = new List<string>();
MatchCollection textFileLinkMatches = Regex.Matches(myHtmlString, txtFileExp, RegexOptions.IgnoreCase);
foreach (Match m in textFileLinkMatches)
{
foundTextFiles.Add( m.Groups[1].ToString()); // this is your captured group
}
return files;
}
static string BuildHtmlString()
{
return new StringReader(@"<html><head><title>Blah</title></head><body><br/>
<div>Here is your first text file: <a href=""http://myServer.com/blah.txt""></div>
<span>Here is your second text file: <a href=""http://myServer.com/blarg2.txt""></span>
<div>Here is your third text file: <a href=""http://myServer.com/bat.txt.snarg""></div>
<div>Here is your fourth text file: <a href=""http://myServer.com/somefile.txt""></div>
<div>Thanks for visiting!</div></body></html>").ReadToEnd();
}
}
}
Run Code Online (Sandbox Code Playgroud)
Mat*_*hen 13
都不是.将其加载到(X/HT)MLD文档中并使用XPath,这是一种操作XML的标准方法,功能非常强大.要查看的功能是SelectNodes和SelectSingleNode.
由于您显然使用HTML(而不是XHTML),因此您应该使用HTML Agility Pack.大多数方法和属性都与相关的XML类相匹配.
使用XPath的示例实现:
HtmlDocument doc = new HtmlDocument();
doc.Load(new StringReader(@"<html>
<head><title>Blah</title>
</head>
<body>
<br/>
<div>Here is your first text file: <a href=""http://myServer.com/blah.txt""></div>
<span>Here is your second text file: <a href=""http://myServer.com/blarg2.txt""></span>
<div>Here is your third text file: <a href=""http://myServer.com/bat.txt""></div>
<div>Here is your fourth text file: <a href=""http://myServer.com/somefile.txt""></div>
<div>Thanks for visiting!</div>
</body>
</html>"));
HtmlNode root = doc.DocumentNode;
// 3 = ".txt".Length - 1. See http://stackoverflow.com/questions/402211/how-to-use-xpath-function-in-a-xpathexpression-instance-programatically
HtmlNodeCollection links = root.SelectNodes("//a[@href['.txt' = substring(., string-length(.)- 3)]]");
IList<string> fileStrings;
if(links != null)
{
fileStrings = new List<string>(links.Count);
foreach(HtmlNode link in links)
fileStrings.Add(link.GetAttributeValue("href", null));
}
else
fileStrings = new List<string>(0);
Run Code Online (Sandbox Code Playgroud)
我会推荐正则表达式。为什么?
只要您会编写正则表达式,正则表达式就不难阅读。
使用它作为正则表达式:
href="([^"]*\.txt)"
解释:
它会转换为转义字符串,如下所示:
string txtExp = "href=\"([^\\\"]*\\.txt)\"
Run Code Online (Sandbox Code Playgroud)
然后你可以迭代你的匹配:
Matches txtMatches = Regex.Matches(input, exp, RegexOptions.IgnoreCase);
foreach(Match m in txtMatches) {
string filename = m.Groups[1]; // this is your captured group
}
Run Code Online (Sandbox Code Playgroud)