ThoughtSpot acquires Mode to define the next generation of collaborative BI >>Learn More

Additional Resources

template

Defining Metrics: A Template 
for Working With Stakeholders

Guide

10 Things Modern Business 
Intelligence Should Enable Today

video

Watch a Product Tour of Mode

Get started with Mode

Working With Email Addresses in SQL

Image of author
Benn Stancil, Co-founder & Chief Analytics Officer

September 24, 2015

3 minute read

datapointer001 parsingemails

datapointer001 parsingemailsAnalyzing datasets that include email addresses can be a pain. But when armed with a few common operations, you can clean your data quickly and dive straight into the analysis. First, you should check if an email address is valid. And if it is, you can break an address into its component parts—before and after the “@” symbol—to classify users as consumers (e.g., gmail.com) or business users (e.g., microsoft.com).

The functions below show how to do these operations directly in SQL. And to help with classifying email domains, we've linked to a table of known consumer email domains in a public Github repo. Educational domains, which fit somewhere between business and consumer domains domains, are addressed as a third class of domain later in this post.

Checking if an email address if valid

There aren't universal rules that determine if an email address is valid. However, most email addresses follow some basic guidelines:

  • There is one, and only one, @ symbol.
  • The text before the @ symbol—the local name—can contain letters, numbers, and special characters including hyphens, periods, pluses, and underscores.
  • The text after the @ symbol—the domain name—can contain letters, numbers, hyphens, and periods.

The functions below determine if a string matches these rules. It's not always accurate, but bad results are rare. In general, it's more likely to falsely identify an invalid email as valid than classify a valid email as invalid.

Postgres and Redshift

/// QUERY ///

SELECT email,
       email ~* '^[A-Za-z0-9._%\-+!#$&/=?^|~]+@[A-Za-z0-9.-]+[.][A-Za-z]+$' AS valid_email,
  FROM email_table

/// RESULT ///

email                   | valid_email
----                    | ----
jerry@gmail.com         | true
gerge@vandalay.com      | true
ebenes@tufts.edu        | true 
kramr.$@cofee_table^^   | false

MySQL

/// QUERY ///

SELECT email,
       email REGEXP '^[A-Za-z0-9._%\-+!#$&/=?^|~]+@[A-Za-z0-9.-]+[.][A-Za-z]+$' AS valid_email
  FROM email_table

/// RESULT ///

email                   | valid_email
----                    | ----
jerry@gmail.com         | 1
gerge@vandalay.com      | 1
ebenes@tufts.edu        | 1 
kramr.$@cofee_table^^   | 0

Splitting an email into local and domain names

Once you've checked if an email is valid, you may want to split it into a local name and a domain name. Domain name analysis can be particularly useful. If you want to see how many companies have signed up for your product, you can often estimate it by counting distinct email domains.

Postgres and Redshift

/// QUERY ///

SELECT email,
       SPLIT_PART(email,'@',1) AS user_name,
       SPLIT_PART(email,'@',2) AS domain_name
  FROM email_table

/// RESULT ///

email                   | user_name   | domain_name
----                    | ----        | ----
jerry@gmail.com         | jerry       | gmail.com
gerge@vandalay.com      | gerge       | vandalay.com
ebenes@tufts.edu        | ebenes      | tufts.edu

MySQL

/// QUERY ///

SELECT email,
       SUBSTRING_INDEX(email,'@',1) AS user_name,
       SUBSTRING_INDEX(email,'@',-1) AS domain_name
  FROM email_table

/// RESULT ///

email                   | user_name   | domain_name
----                    | ----        | ----
jerry@gmail.com         | jerry       | gmail.com
gerge@vandalay.com      | gerge       | vandalay.com
ebenes@tufts.edu        | ebenes      | tufts.edu

Checking for consumer emails

Finally, you may want to classify email addresses as consumer and business domains. After all, if you're counting the number of companies who signed up for your product, you wouldn't want to count gmail.com as a company.

Unfortunately, there's no secret function for identifying consumer email addresses. Instead, you need a lookup table to match on known consumer domains. We found a handy lookup table on Github, compiled by the folks over at Mapbox. To check which email addresses are consumer addresses, LEFT JOIN this table to your email table. (This will label .edu email addresses as business; to classify these as education domains, you can add an additional clause to your CASE statements that looks for domains ending in .edu.)

Postgres and Redshift

/// QUERY ///

SELECT e.email,
       CASE WHEN c.domain IS NOT NULL THEN 'consumer' ELSE 'business' END AS email_type
  FROM email_table e
  JOIN consumer_domains c
    ON c.domain = SPLIT_PART(e.email,'@',2)

/// RESULT ///

email                   | email_type
----                    | ----    
jerry@gmail.com         | consumer
gerge@vandalay.com      | business
ebenes@tufts.edu        | business

MySQL

/// QUERY ///

SELECT e.email,
       CASE WHEN c.domain IS NOT NULL THEN 'consumer' ELSE 'business' END AS email_type
  FROM email_table e
  JOIN consumer_domains c
    ON c.domain = SUBSTRING_INDEX(email,'@',-1)

/// RESULT ///

email                   | email_type
----                    | ----    
jerry@gmail.com         | consumer
gerge@vandalay.com      | business
ebenes@tufts.edu        | business

We've compiled the key syntax from this post into one bite-size image. Get ready for more "datapointers!" :)

Datapointer #1 - Parsing Email Addresses

Get our weekly data newsletter

Work-related distractions for data enthusiasts.

Additional Resources

template

Defining Metrics: A Template 
for Working With Stakeholders

Guide

10 Things Modern Business 
Intelligence Should Enable Today

video

Watch a Product Tour of Mode

Get started with Mode