Extract content from HTML and split words into an array using C#

The other day I was messing around with Full Text Search with MongoDB (which I’ll probably write about in the future), and wanted to take some HTML content and remove all the HTML tags, extract the actual content, then split up the words into an array.

I wrote the following extension method that seems to do the trick. I’ve not tested it too much so there are likely some scenarios where it won’t work, but for those I’ve tried it’s worked fine.

using System.Text.RegularExpressions;
using System.Web;

public static string[] Tokenize(this string value)
{
    //Remove Html tags
    value = Regex.Replace(value, @"<.*?>", string.Empty);

    //Decode Html characters
    value = HttpUtility.HtmlDecode(value);

    //Remove everything but letters, numbers and whitespace characters
    value = Regex.Replace(value, @"[^\w\s]", string.Empty);

    //Remove multiple whitespace characters
    value = Regex.Replace(value, @"\s+", " ");

    //Trim, set to lower case and split to array
    return value.Trim().ToLower().Split(' ');
}
Posted on by Joe in C#

3 Responses to Extract content from HTML and split words into an array using C#

  1. Hero

    In which library is HttpUtility?

  2. matbaa

    was a very useful application code

Add a Comment