我正在开发一个webcrawler.目前我刮掉整个内容然后使用正则表达式我删除<meta>, <script>, <style>和其他标签,并获取身体的内容.
但是,我正在尝试优化性能,我想知道是否有一种方法可以只刮掉<body>页面的内容?
namespace WebScrapper
{
public static class KrioScraper
{
public static string scrapeIt(string siteToScrape)
{
string HTML = getHTML(siteToScrape);
string text = stripCode(HTML);
return text;
}
public static string getHTML(string siteToScrape)
{
string response = "";
HttpWebResponse objResponse;
HttpWebRequest objRequest =
(HttpWebRequest) WebRequest.Create(siteToScrape);
objRequest.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; " +
"Windows NT 5.1; .NET CLR 1.0.3705)";
objResponse = (HttpWebResponse) objRequest.GetResponse();
using (StreamReader sr =
new StreamReader(objResponse.GetResponseStream()))
{
response = sr.ReadToEnd();
sr.Close();
} …Run Code Online (Sandbox Code Playgroud)