Find unique words in a string using Regular Expressions
Regular expressions provide a flexible and efficient way to process text. Their
extensive pattern-matching notation lets you to quickly parse large amounts of
text to find specific character patterns and to extract, edit, replace, or
delete text substrings. Here I show how to locate unique words in a text
document. If you are unfamiliar with regular expressions, see my
regular expressions page.
One way to locate all unique words in a string or document is to use a
hashtable.
Imports System.Text.RegularExpressions
Dim test As String = "one two three two four five three tone"
Dim re As New Regex("\w+") ' \w+ matches any word
Dim words As New Hashtable
For Each m As Match in re.Matches(text)
If Not words.Contains(m.Value) Then
words.Add(m.Value, Nothing)
End If
Next
In the end, the hashtable will contain: "one two three four five tone". You can
achieve the same result using the following regular expression:
Imports System.Text.RegularExpressions
Dim pattern As String = "(?\b\w+\b)(?!.+\b\k\b)"
Dim re As New Regex(pattern)
For Each m As Match in re.Matches(text)
Console.Write(m.Value & " ")
Next
The "(?\b\w+\b)" regular expression pattern match a sequence of alphanumeric
characters "(\w)" on a word boundary "(\b)" and assign the sequence the name word. The
"(?!)" means that the word just found must not be followed by another occurrence of itself "(\k)" even if
there are any characters "(.+)" in between.
The regular expression finds all unique words. The "\b" pattern prevents partial
matches ("one" will not match the end of "tone".
You can also display all unique dates in the form mm-dd-yy in a string
using the pattern: "(?\d\d-\d\d-\d\d)(?!.+\k)".
With a change to the regular expression you can find all the duplicate words in
a document using: "(?\b\w+\b)(?=.+\b\k\b)". The "(?=)" means the
word match must be followed by another instance of itself. This will find
duplicates which means it finds two duplicates if there are three occurrences
of a word.
|