Using the HtmlAgilityPack

Thursday, February 09, 2006 12:00:00 AM (Central Standard Time, UTC-06:00)
Parsing Html with regular expressions can be an excruciating experience.  The cryptic syntax can take hours to get right.  What if you could treat malformed html like xml and use XPath to select the html tags you were looking for.
 
The HtmlAgility pack allows you to do just that.  It handles parsing the Html document and making navigable via XPath (implements the IXPathNavigable interface).  It also has facilities for downloading the HTML from a Url.  After the document has been parsed you have the capability to update the document.  In addition, there is examples of how to:
 
  • Harvest links from an Html document.
  • Convert the html document to text.
  • Convert the html document to xml.
  • Create an RSS feed from existing html content.
 
I recently used it to create an implementation for the RFC Standard 2557, also known as MIME Encapsulation of Aggregate Documents or Mht Web Archive files.  The same files you create in Internet Explorer by selecting File --> Save As and then modify the file type to be Web Archive. 
 
The implementation I wrote is based on this article Convert any URL to a MHTML archive using native .NET code that I found on CodeProject a while back.  The CodeProject implementation was written in VB.Net and used regular expressions heavily.  In addition, the first version of the implementation was file based and required all of the files to be written to disk before they were embedded into a single mht file.  The later version offers an in memory option so the files no longer need to be persisted to the disk. 
 
The implementation I wrote is based in C# and relies on the HtmlAgility pack to do the heavy lifting of parsing the Html for external resources.  I did have to use one regular expression to find the @import directives with in a style tag. 
 
Here is the code written to create a mht file:
 
string htmlLocation = "http://blogs.technet.com/justinbraun/archive/2005/11/07/413859.aspx";
 
WebResourceDownloader downloader = newWebResourceDownloader();
MhtDocument mht = newMhtDocument(htmlLocation, downloader);
string result = mht.CreateMht();
System.IO.File.WriteAllText("justinbraun.mht", result);
 
Below is an example capture of a web page using the implementation I wrote.  It's a snapshot of my brothers blog.