A C# web crawler with page details

It’s been a while since we got back to basics here. While I was browsing about meta tags, I came across a post by R. Reid regarding the robots.txt convention and a simple class to parse the file.

Normally I would just take a peek at the source and move on, but since this place has been bland lately, I thought I’d write up a quick crawler. But I didn’t just want a class that would pull up a list of links on page and move on to crawl each one. I wanted something that would put the grabbed data into some useful form. Maybe an object that can be stored later. So here’s the start of a basic console application to do just that.

I was almost tempted to call this “theCrawler” in keeping with my previous project “theForum”, but I thought I might as well get fancy. So here we have the Entry class for “TauCephei” (yes, I just put that name together in a few seconds).

Update:
Um… If someone suddenly found a spike in traffic from a no-name bot a little while ago… Well, now you know why.

Warning: I finished this in just under a couple of hours, so you’ll want to do a LOT of bugtesting, scrubbing and polishing before getting down to using this in an actual project.

THE SOFTWARE IS PROVIDED “AS IS” AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace TauCephei.Crawler
{
    public class Entry
    {
        public string Id { get; set; }
        public string Url { get; set; }

        // Meta and description information
        public string Title { get; set; }
        public string Description { get; set; }
        public string Abstract { get; set; }

        public string Author { get; set; }
        public string Copyright { get; set; }

        public Dictionary<string, string> Meta { get; set; }
        public string Encoding { get; set; }

        public List<string> Keywords { get; set; }


        // Internal statistics
        public DateTime LastCrawl { get; set; }
        public float Rank { get; set; }

        // Link data
        public List<string> LocalLinks { get; set; }
        public List<string> ExternalLinks { get; set; }
        

        public string BodyHtml { get; set; }
        public string BodyText { get; set; }

        // Constructor
        public Entry()
        {
            this.Meta = new Dictionary<string, string>();

            this.Keywords = new List<string>();
            this.LocalLinks = new List<string>();
            this.ExternalLinks = new List<string>();
        }
    }
}

The Entry class is the object equivalent of a web document with just the relevant details necessary for sorting and retreival. One page = one entry.

But of course, we’ll need to extract the content on a webpage in a meaningful way. Traditionally, this was done with a whole heap of Regular Expressions in your classes, but I opted to use the ready made HtmlAgilityPack instead. Just download the sources and copy the needed code files to your project. I created a folder called /Lib in my project and a subfolder called /Helpers. The HtmlAgilityPack files went in their own folder inside.

And while we’re in the /Helpers folder, we’ll create a couple of helper classes to process the data. Here’s CrawlUtil.cs :

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Net;

using HtmlAgilityPack;


namespace TauCephei.Helpers
{
    public static class CrawlUtil
    {
        /// <summary>
        /// Converts a given URL to a qualified absolute URI
        /// </summary>
        public static Uri ConvertToUri(Uri root, string url)
        {
            Uri uri = new Uri(url, UriKind.RelativeOrAbsolute);

            // Make it absolute if it's relative
            if (!uri.IsAbsoluteUri)
                uri = new Uri(root, uri);

            return uri;
        }

        /// <summary>
        /// Gets the given key value for the meta dictionary
        /// </summary>
        public static string GetMetaKey(Dictionary<string, string> meta, 
            string k, string v, int l = 255)
        {
            if (meta.ContainsKey(k))
                return Util.DefaultFlatString(meta[k], v, l);

            // Default value
            return v;
        }

        /// <summary>
        /// Gets the encoding part of a given charset string
        /// E.G. text/html; charset=utf-8 (last part)
        /// </summary>
        public static string GetEncoding(string enc)
        {
            enc = Util.DefaultFlatString(enc, "utf-8");
            string[] e = enc.Split(';');
            foreach (string c in e)
            {
                if (c.IndexOf("charset") >= 0)
                {
                    enc = c.Split('=')[1].ToLower().Trim();
                }
            }

            return enc;
        }

        /// <summary>
        /// Gets the value of a given HTML tag
        /// </summary>
        public static string GetTagValue(HtmlDocument doc, string tagName)
        {
            string value = "";

            var tag = (from t in doc.DocumentNode.Descendants()
                       where t.Name == tagName &&
                       t.InnerText != null
                       select t).FirstOrDefault();

            // We found a tag and it isn't empty
            if (tag != null)
            {
                if (!String.IsNullOrEmpty(tag.InnerText))
                    value = tag.InnerText;
            }

            return value;
        }


        /// <summary>
        /// Gets the meta tags of a document as a dictionary 
        /// </summary>
        public static Dictionary<string, string> GetMetaTags(HtmlDocument doc)
        {
            Dictionary<string, string> tags = new Dictionary<string, string>();

            // All meta tags
            var meta = (from m in doc.DocumentNode.Descendants()
                        where m.Name == "meta"
                        select m);

            foreach (var tag in meta)
            {
                // Sticking to "name" rather than "http-equiv"
                if (tag.Attributes["name"] != null && tag.Attributes["content"] != null)
                {
                    string cn = tag.Attributes["name"].Value;
                    string cv = tag.Attributes["content"].Value;

                    // No need to add empty meta tags
                    if (!String.IsNullOrEmpty(cn) && !String.IsNullOrEmpty(cv))
                    {
                        // Duplicate tags are ignored
                        if (!tags.ContainsKey(cn))
                            tags.Add(cn, cv);
                    }
                }
            }

            return tags;
        }


        /// <summary>
        /// Gets a friendlier exception message on errors
        /// </summary>
        public static string GetException(WebException ex)
        {
            string exception = "";
            switch (ex.Status)
            {
                case WebExceptionStatus.NameResolutionFailure:
                    exception = "Error : Invalid domain name";
                    break;

                case WebExceptionStatus.Timeout:
                    exception = "Error : Server failed to respond in time";
                    break;

                case WebExceptionStatus.ServerProtocolViolation:
                    exception = "Error : Server violated HTTP protocol";
                    break;

                case WebExceptionStatus.ProtocolError:
                    exception = "Error : " + 
                        CrawlUtil.GetHttpError((HttpWebResponse)ex.Response);
                    break;

                case WebExceptionStatus.UnknownError:
                default:
                    exception = "Error : An unknown error has occured";
                    break;
            }

            return exception;
        }

        /// <summary>
        /// Gets an http error status
        /// </summary>
        public static string GetHttpError(HttpWebResponse resp)
        {
            string status = "";
            switch (resp.StatusCode)
            {
                case HttpStatusCode.NotFound:
                    status = "404 File Not Found";
                    break;
                case HttpStatusCode.Forbidden:
                    status = "403 Access Denied";
                    break;
                case HttpStatusCode.InternalServerError:
                    status = "500 Server Error";
                    break;
                case HttpStatusCode.Gone:
                    status = "410 Page Removed";
                    break;

                // This really shouldn't happen...
                case HttpStatusCode.BadRequest:
                    status = "400 Bad Request";
                    break;

                // If it doesn't fit any of the above
                default:
                    status = resp.StatusDescription;
                    break;
            }

            return status;
        }

    }
}

Eagle eyed readers would notice the Util class being referenced here. But for our purposes, we’ll just need a few functions in the Util class.

Onward to Util.cs—->

3 thoughts on “A C# web crawler with page details

  1. Pingback: Web Crawler - USEFUL PORTAL – USEFUL PORTAL

  2. nice solution, but after publishing on server, we get an issue like, after some set of calls, google engine will ask you for Network traffic captcha window, where you have to enter your captha which shown in image, then only you can able to view your results.
    Is there a way we can bypass that captcha, and directly view the results.

    Please provide me some solution for this problem.
    Raj Mouli(mouli.raji@gmail.com)

    • Hi Raj,

      Unfortunately, there’s no direct or simple way to do that as CAPTCHAs are specifically designed to make sure you’re not a crawler or bot ;)

      There are ways around that, but they go in to optical character recognition and a whole heap of programming that go beyond the basic crawler. Maybe I’ll do a post about it in the future, but in the meantime, you can look into “CAPTCHA auto fill” or something similar.

Leave a comment