Analyzing datasets that include email addresses can be a pain. But when armed with a few common operations, you can clean your data quickly and dive straight into the analysis. First, you should check if an email address is valid. And if it is, you can break an address into its component parts—before and after the “@” symbol—to classify users as consumers (e.g., gmail.com) or business users (e.g., microsoft.com).
The functions below show how to do these operations directly in SQL. And to help with classifying email domains, we've linked to a table of known consumer email domains in a public Github repo. Educational domains, which fit somewhere between business and consumer domains domains, are addressed as a third class of domain later in this post.
Checking if an email address if valid
There aren't universal rules that determine if an email address is valid. However, most email addresses follow some basic guidelines:
- There is one, and only one,
@
symbol. - The text before the
@
symbol—the local name—can contain letters, numbers, and special characters including hyphens, periods, pluses, and underscores. - The text after the
@
symbol—the domain name—can contain letters, numbers, hyphens, and periods.
The functions below determine if a string matches these rules. It's not always accurate, but bad results are rare. In general, it's more likely to falsely identify an invalid email as valid than classify a valid email as invalid.
Postgres and Redshift
/// QUERY ///
SELECT email,
email ~* '^[A-Za-z0-9._%\-+!#$&/=?^|~]+@[A-Za-z0-9.-]+[.][A-Za-z]+$' AS valid_email,
FROM email_table
/// RESULT ///
email | valid_email
---- | ----
jerry@gmail.com | true
gerge@vandalay.com | true
ebenes@tufts.edu | true
kramr.$@cofee_table^^ | false
MySQL
/// QUERY ///
SELECT email,
email REGEXP '^[A-Za-z0-9._%\-+!#$&/=?^|~]+@[A-Za-z0-9.-]+[.][A-Za-z]+$' AS valid_email
FROM email_table
/// RESULT ///
email | valid_email
---- | ----
jerry@gmail.com | 1
gerge@vandalay.com | 1
ebenes@tufts.edu | 1
kramr.$@cofee_table^^ | 0
Splitting an email into local and domain names
Once you've checked if an email is valid, you may want to split it into a local name and a domain name. Domain name analysis can be particularly useful. If you want to see how many companies have signed up for your product, you can often estimate it by counting distinct email domains.
Postgres and Redshift
/// QUERY ///
SELECT email,
SPLIT_PART(email,'@',1) AS user_name,
SPLIT_PART(email,'@',2) AS domain_name
FROM email_table
/// RESULT ///
email | user_name | domain_name
---- | ---- | ----
jerry@gmail.com | jerry | gmail.com
gerge@vandalay.com | gerge | vandalay.com
ebenes@tufts.edu | ebenes | tufts.edu
MySQL
/// QUERY ///
SELECT email,
SUBSTRING_INDEX(email,'@',1) AS user_name,
SUBSTRING_INDEX(email,'@',-1) AS domain_name
FROM email_table
/// RESULT ///
email | user_name | domain_name
---- | ---- | ----
jerry@gmail.com | jerry | gmail.com
gerge@vandalay.com | gerge | vandalay.com
ebenes@tufts.edu | ebenes | tufts.edu
Checking for consumer emails
Finally, you may want to classify email addresses as consumer and business domains. After all, if you're counting the number of companies who signed up for your product, you wouldn't want to count gmail.com as a company.
Unfortunately, there's no secret function for identifying consumer email addresses. Instead, you need a lookup table to match on known consumer domains. We found a handy lookup table on Github, compiled by the folks over at Mapbox. To check which email addresses are consumer addresses, LEFT JOIN
this table to your email table. (This will label .edu
email addresses as business; to classify these as education domains, you can add an additional clause to your CASE
statements that looks for domains ending in .edu
.)
Postgres and Redshift
/// QUERY ///
SELECT e.email,
CASE WHEN c.domain IS NOT NULL THEN 'consumer' ELSE 'business' END AS email_type
FROM email_table e
JOIN consumer_domains c
ON c.domain = SPLIT_PART(e.email,'@',2)
/// RESULT ///
email | email_type
---- | ----
jerry@gmail.com | consumer
gerge@vandalay.com | business
ebenes@tufts.edu | business
MySQL
/// QUERY ///
SELECT e.email,
CASE WHEN c.domain IS NOT NULL THEN 'consumer' ELSE 'business' END AS email_type
FROM email_table e
JOIN consumer_domains c
ON c.domain = SUBSTRING_INDEX(email,'@',-1)
/// RESULT ///
email | email_type
---- | ----
jerry@gmail.com | consumer
gerge@vandalay.com | business
ebenes@tufts.edu | business
We've compiled the key syntax from this post into one bite-size image. Get ready for more "datapointers!" :)