Find unique words in a string using Regular Expressions

Regular expressions provide a flexible and efficient way to process text. Their extensive pattern-matching notation lets you to quickly parse large amounts of text to find specific character patterns and to extract, edit, replace, or delete text substrings. Here I show how to locate unique words in a text document. If you are unfamiliar with regular expressions, see my regular expressions page.

One way to locate all unique words in a string or document is to use a hashtable.

    Imports System.Text.RegularExpressions

    Dim test As String = "one two three two four five three tone"
    Dim re As New Regex("\w+")  ' \w+ matches any word
    Dim words As New Hashtable
    
    For Each m As Match in re.Matches(text)
        If Not words.Contains(m.Value) Then
            words.Add(m.Value, Nothing)
        End If
    Next

In the end, the hashtable will contain: "one two three four five tone". You can achieve the same result using the following regular expression:

    Imports System.Text.RegularExpressions

    Dim pattern As String = "(?\b\w+\b)(?!.+\b\k\b)"
    Dim re As New Regex(pattern)
     
    For Each m As Match in re.Matches(text)
        Console.Write(m.Value & " ")
    Next

The "(?\b\w+\b)" regular expression pattern match a sequence of alphanumeric characters "(\w)" on a word boundary "(\b)" and assign the sequence the name word. The "(?!)" means that the word just found must not be followed by another occurrence of itself "(\k)" even if there are any characters "(.+)" in between.

The regular expression finds all unique words. The "\b" pattern prevents partial matches ("one" will not match the end of "tone".

You can also display all unique dates in the form mm-dd-yy in a string using the pattern: "(?\d\d-\d\d-\d\d)(?!.+\k)".

With a change to the regular expression you can find all the duplicate words in a document using: "(?\b\w+\b)(?=.+\b\k\b)". The "(?=)" means the word match must be followed by another instance of itself. This will find duplicates which means it finds two duplicates if there are three occurrences of a word.

About TheScarms

Sample code
version info

If you use this code, please mention "www.TheScarms.com"

Email this page