Posted by Joe | Posted in C# | Posted on 10-07-2011
The other day I was messing around with Full Text Search with MongoDB (which I’ll probably write about in the future), and wanted to take some HTML content and remove all the HTML tags, extract the actual content, then split up the words into an array.
I wrote the following extension method that seems to do the trick. I’ve not tested it too much so there are likely some scenarios where it won’t work, but for those I’ve tried it’s worked fine.
using System.Text.RegularExpressions;
using System.Web;
public static string[] Tokenize(this string value)
{
//Remove Html tags
value = Regex.Replace(value, @"<.*?>", string.Empty);
//Decode Html characters
value = HttpUtility.HtmlDecode(value);
//Remove everything but letters, numbers and whitespace characters
value = Regex.Replace(value, @"[^\w\s]", string.Empty);
//Remove multiple whitespace characters
value = Regex.Replace(value, @"\s+", " ");
//Trim, set to lower case and split to array
return value.Trim().ToLower().Split(' ');
}



In which library is HttpUtility?
Hero
System.Web
http://msdn.microsoft.com/en-us/library/system.web.httputility.aspx
Joe