用C#下载整个网站

Kar*_*hik 7 c# screen-scraping screen download web

请原谅我对这个问题的无知

我在用

 string p="http://" + Textbox2.text;
 string r= textBox3.Text;
 System.Net.WebClient webclient=new
 System.Net.Webclient();
 webclient.DownloadFile(p,r);
Run Code Online (Sandbox Code Playgroud)

下载网页.你可以帮助我增强代码,以便下载整个网站.尝试使用HTML Screen Scraping但它只返回index.html文件的href链接.我该如何继续前进

谢谢

Wil*_*ill 10

刮网站实际上是很多工作,有很多极端情况.

请改为调用wget.该手册介绍了如何使用" 递归检索 "选项.


Jas*_*son 9

 protected string GetWebString(string url)
    {
        string appURL = url;
        HttpWebRequest wrWebRequest = WebRequest.Create(appURL) as HttpWebRequest;
        HttpWebResponse hwrWebResponse = (HttpWebResponse)wrWebRequest.GetResponse();

        StreamReader srResponseReader = new StreamReader(hwrWebResponse.GetResponseStream());
        string strResponseData = srResponseReader.ReadToEnd();
        srResponseReader.Close();
        return strResponseData;
    }
Run Code Online (Sandbox Code Playgroud)

这会将网页放入提供的URL中的字符串中.

然后,您可以使用REGEX来解析字符串.

这个小小的片段从craigslist中获取特定链接并将它们添加到arraylist ...修改为您的目的.

 protected ArrayList GetListings(int pages)
    {
            ArrayList list = new ArrayList();
            string page = GetWebString("http://albany.craigslist.org/bik/");

            MatchCollection listingMatches = Regex.Matches(page, "(<p><a href=\")(?<LINK>/.+/.+[.]html)(\">)(?<TITLE>.*)(-</a>)");
            foreach (Match m in listingMatches)
            {
                list.Add("http://albany.craigslist.org" + m.Groups["LINK"].Value.ToString());
            }
            return list;
    }
Run Code Online (Sandbox Code Playgroud)