Sometimes you might have a need to get some specific information from tag soup of Internet. You have two options here – either to use some existing solution or simply write your own. Generally first option seems more reasonable (why to invent another wheel), but your requirements might be very specific or … it seems like a lot of fun! Since I’ve chosen this second path I’m going to share some of my insights useful particularly for .NET developers.
Important: if you need scalable, large, google like spider then you should really consider looking at existing solutions. My thoughts are relevant for scanner able to process few hundreds pages per minute on a single machine, while provided a list of root nodes to scan.
When I mentioned at Software Craftsman meeting that now I’m writing software scanner guy sitting next to me looked at me with disbelief and said “Dude, 1995 is over”. There is always a question on whether to create something from scratch to perfectly suit your needs in opposite to using ready made solution. There is gold sentence that every consultant is overusing and it is "that depends”. In past I was a part of a team who in two years successfully crafted and implemented ERP solution. Previous implementation of ready made software failed miserably. Ok, enough of philosophy, let’s get down to business.
Getting a web page.
Is easy like this:
var webRequest = (HttpWebRequest)WebRequest.Create(url);
webRequest.Method = "GET";
webRequest.UserAgent = "Mozilla/5.0 (compatible; mycrawler/1.0)";
response = webRequest.GetResponse();
Then you just need to GetResponseStream on your response object, handle all exceptions, check content type (we prefer something that contains text/html) and we can parse our text. But…
Regular Expressions are not best solution.
It’s usually first thought you get when approaching this problem. Looking for href seems like piece of cake. But it’s not – I’m not going to dig into this topic – I think best answer on StackOverflow is going to convince you, if not check Jeff Atwood post.
So what is our quick and lazy alternative ? Mine was HtmlAgilityPack. What is it?
This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).
Getting all anchor tags from a webpage is as easy as:
Looking for links.
Not only <A /> tags are links. <AREA /> as well (inside <MAP />). You will have to follow frames and pages with instant redirection through META refresh tag. In all these cases HtmlAgilityPack is very helpful:
.Select(x => x.Attributes["href"]);
.Select(x => x.Attributes["src"]);
*code above is missing null checks etc.
Pages contain both absolute and relative URLs, we really don’t want to analyze strings to deal with that kind of stuff. System.Uri comes to the rescue. Some useful code snippets utilizing this class:
// Is our text an absolute url ?
if (Uri.TryCreate(urlCandidate, UriKind.Absolute, out url)) return url;
// If not - is it a relative url ?
// Try to crate url on a base of url from this page
if (Uri.TryCreate(baseUrl, urlCandidate, out url)) return url;
Breadth first crawling ?
For simple scanner your might prefer breadth first crawling. Probably you will need two lists – one list containing pages to visit, and other list containing already visited pages (we don’t want to get into loop). Even for simple solutions though we need some degree of parallelism – downloading page by page can be very slow task.
When to stop ?
It depends on your needs – but if you haven’t decided to create a "better google” than probably for example you want to scan one specific domain. Here again class Uri and especially Uri.IsBaseOf() method is very useful, we can check every link before adding it to “to visit” list. But we should definitely create some other stop criteria – like visit counter of specific URLs (calendar controls are sometimes a trap for crawler), time spent inside specific domain etc.
Getting information out of HTML soup.
Usually you are building your scanner to get something specific, looking for specific phrase, for specific link, element, to download images etc. And again HtmlAgilityPack is a bless here (seriously, why would you do this with Regular Expressions?). For example:
// getting all alt nodes of images
Should I do it ?
Well again, it depends (I’m a good material for consultant :-). If you want to crawl and parse tens of thousands of pages in a minute, then probably not. Parallel computing is note an easy piece of cake, and it was well thought and developed by many very smart people – and you can get results of their work as an open source (not in C# though). But if you want to scan reasonable amount of pages, you want to get specific information and robots.txt on your site of destination is not denying it, then go ahead. And good luck.