how to EXTRACT EMAIL addresses
Sometimes you have exported data from your website or business software
containing order information or customer details.
You may have only needed the email address and order date.
One way is to import all the data into Excel, delete the unwanted columns
and export the remaining ones.
This may not work well if the email field also contains the email address description,
for example: “Dave Martin <davemartin@bogusemail.com>”.
It can be cumbersome if you have to repeat the task multiple times
or if you have to explain all the steps to someone else.
Extract the desired data using “regex”
A regular expression (shortened as “regex” or “regexp”),
is a sequence of characters that specifies a matching pattern in text.
A very simple case is to locate a word spelled two different ways in a text editor,
the regular expression seriali[sz]e
matches both “serialise” and “serialize”.
A more complex situation is the syntax for identifying in the text
-
an email addresses:
[a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+
source: stackoverflow - regex extract email from strings -
a date:
\d{4}-\d{2}-\d{1,2}
source: stackoverflow - regex for extracting date from string
Regular Expressions (Regex) Tutorial
Recommended YouTube video
“38 mins well spent, totally worth it” :
How to Match Any Pattern of Text
(from minute 25 the syntax for extracting email addresses is explained)
Cheat sheet for using regular expressions
RegExr online tool
Regular expressions are generally accepted
within advanced text editors like Notepad++ or Atom.
Free online tools are also available, one of them is:
https://regexr.com - an online service to learn, build & test Regular Expressions.
Web interface explanation:
“Expression” is the field that contains the regex syntax.
“Text” is the content you want to analyze.
“Tools > List” will show the results of the extraction.
Example 1: to extract only the email address
Expression:
[a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+
Text:
Dave Martin
615-555-7164
173 Main St., Springfield RI 55924
davemartin@bogusemail.com
Charles Harris
800-555-5669
969 High St., Atlantis VA 34075
charlesharris@bogusemail.com
Eric Williams
560-555-5153
806 1st St., Faketown AK 86847
laurawilliams@bogusemail.com
Tools > List:
$&\n
Result:
davemartin@bogusemail.com
charlesharris@bogusemail.com
laurawilliams@bogusemail.com
Example 2: to extract the email address and the date
Expression:
","(.*?)([a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)(.*?)",".*",(\d{2}\.\d{2}\.\d{4})
Text:
"lorem ipsum dolor sit amet","Robert Farrell <rmfarrell@bogusemail.com>","",02.01.2024, ,5379,
"consectetur adipiscing elit","""Mesa, Rene <rmesa@bogusemail.com>""","",04.01.2024, ,20826,
"sed do eiusmod tempor incididunt","Antonio Bugan <antonio@bogusemail.com>","",04.01.2024, ,2856,
"ut labore et dolore magna aliqua","Crawley Down Tennis Club <hello@bogusemail.com>","",05.01.2024, ,4453,
Tools > List:
$2,$4\n
Result:
rmfarrell@bogusemail.com,02.01.2024
rmesa@bogusemail.com,04.01.2024
antonio@bogusemail.com,04.01.2024
hello@bogusemail.com,05.01.2024
Cheat sheet for using regular expressions
. - Any Character Except New Line
\d - Digit (0-9)
\D - Not a Digit (0-9)
\w - Word Character (a-z, A-Z, 0-9, _)
\W - Not a Word Character
\s - Whitespace (space, tab, newline)
\S - Not Whitespace (space, tab, newline)
\b - Word Boundary
\B - Not a Word Boundary
^ - Beginning of a String
$ - End of a String
[] - Matches Characters in brackets
[^ ] - Matches Characters NOT in brackets
| - Either Or
( ) - Group
Quantifiers:
* - 0 or More
+ - 1 or More
? - 0 or One
{3} - Exact Number
{3,4} - Range of Numbers (Minimum, Maximum)
source: github code snippets