Bank of America Bot Cares About You

I’ll let the horror of automation speak for itself. Not that you need further reassurance that there are no “souls” aboard this ship.

WARNING: Some naughty language (this is completely uncensored).

We'd be happy to review your account with you to discuss any concerns. Please let us know if you need assistance.
  • We’d be happy to review your account with you to discuss any concerns. Please let us know if you need assistance.
  • We are here to help, listen, and learn from our customers and are glad to assist with any account related inquiries.
  • Thank you for following up. Have a great weekend!
  • I work for Bank of America, is there anything I can do to help?
  • Hi [name], what happened? Anything I can do to help?.
  • Thank you for the feedback, we will share it with our leadership team.

I took a bunch of screenshots and stitched it together in Photoshop. The screenshots themselves were completely unaltered.

A C# web crawler with page details

It’s been a while since we got back to basics here. While I was browsing about meta tags, I came across a post by R. Reid regarding the robots.txt convention and a simple class to parse the file.

Normally I would just take a peek at the source and move on, but since this place has been bland lately, I thought I’d write up a quick crawler. But I didn’t just want a class that would pull up a list of links on page and move on to crawl each one. I wanted something that would put the grabbed data into some useful form. Maybe an object that can be stored later. So here’s the start of a basic console application to do just that.

I was almost tempted to call this “theCrawler” in keeping with my previous project “theForum”, but I thought I might as well get fancy. So here we have the Entry class for “TauCephei” (yes, I just put that name together in a few seconds).

Update:
Um… If someone suddenly found a spike in traffic from a no-name bot a little while ago… Well, now you know why.

Warning: I finished this in just under a couple of hours, so you’ll want to do a LOT of bugtesting, scrubbing and polishing before getting down to using this in an actual project.

THE SOFTWARE IS PROVIDED “AS IS” AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace TauCephei.Crawler
{
    public class Entry
    {
        public string Id { get; set; }
        public string Url { get; set; }

        // Meta and description information
        public string Title { get; set; }
        public string Description { get; set; }
        public string Abstract { get; set; }

        public string Author { get; set; }
        public string Copyright { get; set; }

        public Dictionary<string, string> Meta { get; set; }
        public string Encoding { get; set; }

        public List<string> Keywords { get; set; }


        // Internal statistics
        public DateTime LastCrawl { get; set; }
        public float Rank { get; set; }

        // Link data
        public List<string> LocalLinks { get; set; }
        public List<string> ExternalLinks { get; set; }
        

        public string BodyHtml { get; set; }
        public string BodyText { get; set; }

        // Constructor
        public Entry()
        {
            this.Meta = new Dictionary<string, string>();

            this.Keywords = new List<string>();
            this.LocalLinks = new List<string>();
            this.ExternalLinks = new List<string>();
        }
    }
}

The Entry class is the object equivalent of a web document with just the relevant details necessary for sorting and retreival. One page = one entry.

But of course, we’ll need to extract the content on a webpage in a meaningful way. Traditionally, this was done with a whole heap of Regular Expressions in your classes, but I opted to use the ready made HtmlAgilityPack instead. Just download the sources and copy the needed code files to your project. I created a folder called /Lib in my project and a subfolder called /Helpers. The HtmlAgilityPack files went in their own folder inside.

And while we’re in the /Helpers folder, we’ll create a couple of helper classes to process the data. Here’s CrawlUtil.cs :

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Net;

using HtmlAgilityPack;


namespace TauCephei.Helpers
{
    public static class CrawlUtil
    {
        /// <summary>
        /// Converts a given URL to a qualified absolute URI
        /// </summary>
        public static Uri ConvertToUri(Uri root, string url)
        {
            Uri uri = new Uri(url, UriKind.RelativeOrAbsolute);

            // Make it absolute if it's relative
            if (!uri.IsAbsoluteUri)
                uri = new Uri(root, uri);

            return uri;
        }

        /// <summary>
        /// Gets the given key value for the meta dictionary
        /// </summary>
        public static string GetMetaKey(Dictionary<string, string> meta, 
            string k, string v, int l = 255)
        {
            if (meta.ContainsKey(k))
                return Util.DefaultFlatString(meta[k], v, l);

            // Default value
            return v;
        }

        /// <summary>
        /// Gets the encoding part of a given charset string
        /// E.G. text/html; charset=utf-8 (last part)
        /// </summary>
        public static string GetEncoding(string enc)
        {
            enc = Util.DefaultFlatString(enc, "utf-8");
            string[] e = enc.Split(';');
            foreach (string c in e)
            {
                if (c.IndexOf("charset") >= 0)
                {
                    enc = c.Split('=')[1].ToLower().Trim();
                }
            }

            return enc;
        }

        /// <summary>
        /// Gets the value of a given HTML tag
        /// </summary>
        public static string GetTagValue(HtmlDocument doc, string tagName)
        {
            string value = "";

            var tag = (from t in doc.DocumentNode.Descendants()
                       where t.Name == tagName &&
                       t.InnerText != null
                       select t).FirstOrDefault();

            // We found a tag and it isn't empty
            if (tag != null)
            {
                if (!String.IsNullOrEmpty(tag.InnerText))
                    value = tag.InnerText;
            }

            return value;
        }


        /// <summary>
        /// Gets the meta tags of a document as a dictionary 
        /// </summary>
        public static Dictionary<string, string> GetMetaTags(HtmlDocument doc)
        {
            Dictionary<string, string> tags = new Dictionary<string, string>();

            // All meta tags
            var meta = (from m in doc.DocumentNode.Descendants()
                        where m.Name == "meta"
                        select m);

            foreach (var tag in meta)
            {
                // Sticking to "name" rather than "http-equiv"
                if (tag.Attributes["name"] != null && tag.Attributes["content"] != null)
                {
                    string cn = tag.Attributes["name"].Value;
                    string cv = tag.Attributes["content"].Value;

                    // No need to add empty meta tags
                    if (!String.IsNullOrEmpty(cn) && !String.IsNullOrEmpty(cv))
                    {
                        // Duplicate tags are ignored
                        if (!tags.ContainsKey(cn))
                            tags.Add(cn, cv);
                    }
                }
            }

            return tags;
        }


        /// <summary>
        /// Gets a friendlier exception message on errors
        /// </summary>
        public static string GetException(WebException ex)
        {
            string exception = "";
            switch (ex.Status)
            {
                case WebExceptionStatus.NameResolutionFailure:
                    exception = "Error : Invalid domain name";
                    break;

                case WebExceptionStatus.Timeout:
                    exception = "Error : Server failed to respond in time";
                    break;

                case WebExceptionStatus.ServerProtocolViolation:
                    exception = "Error : Server violated HTTP protocol";
                    break;

                case WebExceptionStatus.ProtocolError:
                    exception = "Error : " + 
                        CrawlUtil.GetHttpError((HttpWebResponse)ex.Response);
                    break;

                case WebExceptionStatus.UnknownError:
                default:
                    exception = "Error : An unknown error has occured";
                    break;
            }

            return exception;
        }

        /// <summary>
        /// Gets an http error status
        /// </summary>
        public static string GetHttpError(HttpWebResponse resp)
        {
            string status = "";
            switch (resp.StatusCode)
            {
                case HttpStatusCode.NotFound:
                    status = "404 File Not Found";
                    break;
                case HttpStatusCode.Forbidden:
                    status = "403 Access Denied";
                    break;
                case HttpStatusCode.InternalServerError:
                    status = "500 Server Error";
                    break;
                case HttpStatusCode.Gone:
                    status = "410 Page Removed";
                    break;

                // This really shouldn't happen...
                case HttpStatusCode.BadRequest:
                    status = "400 Bad Request";
                    break;

                // If it doesn't fit any of the above
                default:
                    status = resp.StatusDescription;
                    break;
            }

            return status;
        }

    }
}

Eagle eyed readers would notice the Util class being referenced here. But for our purposes, we’ll just need a few functions in the Util class.

Onward to Util.cs—->