Extract content from HTML and split words into an array using C#

2

Posted by Joe | Posted in C# | Posted on 10-07-2011

The other day I was messing around with Full Text Search with MongoDB (which I’ll probably write about in the future), and wanted to take some HTML content and remove all the HTML tags, extract the actual content, then split up the words into an array.

I wrote the following extension method that seems to do the trick. I’ve not tested it too much so there are likely some scenarios where it won’t work, but for those I’ve tried it’s worked fine.

using System.Text.RegularExpressions;
using System.Web;

public static string[] Tokenize(this string value)
{
    //Remove Html tags
    value = Regex.Replace(value, @"<.*?>", string.Empty);

    //Decode Html characters
    value = HttpUtility.HtmlDecode(value);

    //Remove everything but letters, numbers and whitespace characters
    value = Regex.Replace(value, @"[^\w\s]", string.Empty);

    //Remove multiple whitespace characters
    value = Regex.Replace(value, @"\s+", " ");

    //Trim, set to lower case and split to array
    return value.Trim().ToLower().Split(' ');
}

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

Comments (2)

In which library is HttpUtility?

Write a comment