A text document is a simple sequence of characters where each character is represented on a computer by an integer value. In a document, we can find patterns that are sequences of characters that we want to extract. For example, we might want to extract email addresses, phone numbers, or URLs from a document. Regular expressions are a powerful tool for extracting information from text documents.
Definition
Regular expressions are just patterns that allow us to search within text documents for specific sequences of characters and provide a powerful language for writing rules to extract content from text documents.
Regular expressions are easy to use and can be highly accurate in reducing false positives. However, they do have limitations, including the need to manually write rules, the possibility of false positives and negatives, and the challenge of incorporating contextual knowledge when extracting information from a document.
In our case, we will utilize regular expressions to identify patterns within a document and extract relevant information whenever those patterns are found.
Examples of Regular Expressions
The simplest pattern is an exact match: the regular expression abc will match the sequence aaabcdddd but not the sequence aaabdddd because the exact pattern doesn’t appear in it. Another simple pattern is a choice between two sequences. For example, the regular expression (abc|bdd) will match both the sequence aaabcdddd and the sequence aaabdddd.
An important pattern involves a wildcard symbol . that matches any character except for the newline character. For example, the regular expression with two consecutive dots a..d will match the sequence aaabcdddd but not the sequence aaabbcddd. Another common pattern involves square brackets [] that indicate a choice for a single character. For example, [abc] is equivalent to (a|b|c) and matches any one of the characters within the brackets; [a-z] is equivalent to (a|b|…|z) and matches any character in the range a, b, …, z; and [^abc] matches any characters except those that match [abc].
We can use several special characters in regular expressions that are prefixed with the backslash character \:
| Character | Description | 
|---|---|
\n | Matches the newline character | 
\t | Matches the tab character | 
\s | Matches any whitespace character | 
\S | Matches any non-whitespace character | 
\d | Matches any digit [0-9] | 
\w | Matches any ‘word’ character [a-zA-Z0-9] | 
The real power of a regular expression comes from repetition. The following patterns tell us how many times the previous character or pattern must be repeated:
| Character | Description | 
|---|---|
* | Matches zero or more times | 
+ | Matches one or more times | 
? | Matches zero or one times | 
{n} | Matches exactly n times | 
{n,m} | Matches at least n and up to m times | 
Example
Considering the regular expression
[a-zA-Z0-9._-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}, which text sequences would it match? It will match the sequence like email addresses, like usernames (e.g., Twitter username@Name), and domain names (e.g.,@example.com).
[a-zA-Z0-9._-]will match any character (alphanumeric, in both uppercase and lowercase, and the characters._-) one or more times.+@will match the@character.[a-zA-Z0-9.-]will match any character (alphanumeric, in both uppercase and lowercase, and the characters.-) one or more times.+\.will match the.character.[a-zA-Z]{2,}will match any character (in both uppercase and lowercase) at least two times (likecom,it…).
Regular expressions offer a powerful language for crafting rules to extract content from text documents. The key benefits of using regular expressions for text extraction include their simplicity and the ability to create precise rules that minimize false positives. However, there are certain limitations to consider, such as the manual process of rule creation, the potential for false positives (due to limited syntactic structure recognition), false negatives (missed extractions), and the challenge of incorporating contextual knowledge when extracting entities.