C#中的简单网络爬虫

Kha*_*med 10 c# web-crawler

我已经创建了一个简单的网络爬虫,但我想添加递归函数,以便打开的每个页面都可以获得此页面中的网址,但我不知道我该怎么做,我还想包含线程来制作它在这里更快,这是我的代码

namespace Crawler
{
    public partial class Form1 : Form
    {
        String Rstring;

        public Form1()
        {
            InitializeComponent();
        }

        private void button1_Click(object sender, EventArgs e)
        {

            WebRequest myWebRequest;
            WebResponse myWebResponse;
            String URL = textBox1.Text;

            myWebRequest =  WebRequest.Create(URL);
            myWebResponse = myWebRequest.GetResponse();//Returns a response from an Internet resource

            Stream streamResponse = myWebResponse.GetResponseStream();//return the data stream from the internet
                                                                       //and save it in the stream

            StreamReader sreader = new StreamReader(streamResponse);//reads the data stream
            Rstring = sreader.ReadToEnd();//reads it to the end
            String Links = GetContent(Rstring);//gets the links only

            textBox2.Text = Rstring;
            textBox3.Text = Links;
            streamResponse.Close();
            sreader.Close();
            myWebResponse.Close();




        }

        private String GetContent(String Rstring)
        {
            String sString="";
            HTMLDocument d = new HTMLDocument();
            IHTMLDocument2 doc = (IHTMLDocument2)d;
            doc.write(Rstring);

            IHTMLElementCollection L = doc.links;

            foreach (IHTMLElement links in  L)
            {
                sString += links.getAttribute("href", 0);
                sString += "/n";
            }
            return sString;
        }
Run Code Online (Sandbox Code Playgroud)

Dar*_*kas 8

我修复了你的GetContent方法,如下所示,从抓取页面获取新链接:

public ISet<string> GetNewLinks(string content)
{
    Regex regexLink = new Regex("(?<=<a\\s*?href=(?:'|\"))[^'\"]*?(?=(?:'|\"))");

    ISet<string> newLinks = new HashSet<string>();    
    foreach (var match in regexLink.Matches(content))
    {
        if (!newLinks.Contains(match.ToString()))
            newLinks.Add(match.ToString());
    }

    return newLinks;
}
Run Code Online (Sandbox Code Playgroud)

更新

修复:正则表达式应该是regexLink.谢谢@shashlearner指出这一点(我的错误).


Mis*_*hex 8

我使用Reactive Extension创建了类似的东西.

https://github.com/Misterhex/WebCrawler

我希望它可以帮助你.

Crawler crawler = new Crawler();

IObservable observable = crawler.Crawl(new Uri("http://www.codinghorror.com/"));

observable.Subscribe(onNext: Console.WriteLine, 
onCompleted: () => Console.WriteLine("Crawling completed"));
Run Code Online (Sandbox Code Playgroud)

  • 哇!这是一些非常简单的语法.这是多线程的吗?无论如何,非常容易消化 - 看起来很像javascript. (2认同)