Email Validation using Regular Expressions (the Right Way)

UPDATE: This article was updated on February 1st, 2008 to account for domain literals and quoted strings such as “John Smith” <john.smith@somewhere.com>. It is now effectively the only complete and semantically correct email validator for Java.

PETTY REQUEST: The update required considerably more effort than the original as it now accounts for all valid RFC parsing conditions. Because of this, and that this page is easily my most visited, I’d appreciate it if you could show your appreciation by hooking a brother up and clicking on some ads. It helps pay for my hosting. Thanks!

In Object-Oriented design, I’m a firm believer in modeling things in they way they truly exist (in as much is possible given abstraction and time restrictions). So, whenver I design a system’s domain model, I create Classes that represent entities as they exist in real life. That being said, I’ve accrued a nice library of Classes that I reuse in a lot of projects.

For example, I don’t save or reference an email address as a String: strings as objects don’t tell me anything about the email address itself, like if its valid, if its bouncing, if it has been verified by the user with which it is associated, etc, etc. As such, I have created an EmailAddress class to represent this information. Doing this is a small example of the beauty of OO over functional programming.

Anyway, I was a little lax in the past in my validation logic. This time on my last project, I was determined to get things right once and for all.

I googled quite a while for the Right Way to validate an email address. In my opinion, there is only one Right Way – the RFC 2822 way. This is the standard after all.

I never came across anything I was happy with. All the responses seemed to be perl or php variant regular experessions or some horribly convoluted text string nearly impossible to decipher. I was disappointed to see so many interpretations of a standard. I mean, c’mon people, its written in pure black and white!!!

I guess the old addage “If you want something done right, you’ve got to do it yourself” resonated in my head this time. I actually took the time out to read the RFC (something I hadn’t done in a long while, probably since college).

After reading the RFC, I translated the grammar into usable, *readable* source code that now resides in my EmailAddress class, and I’ve included it below for the benefit of anyone that wishes to use it. It is written in Java, but the same code could be replicated in C# or PHP or whatever. Just keep it clean!

N.B: Look at the to the first two constants, ALLOW_DOMAIN_LITERALS and ALLOW_QUOTED_IDENTIFIERS – enable or disable them as you see fit for your application.

/*
* Copyright 2008 Les Hazlewood
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

/**
* This constant states that domain literals are allowed in the email address, e.g.:
*
*

someone@[192.168.1.100] or

* john.doe@[23:33:A2:22:16:1F] or

* me@[my computer]

*
*

The RFC says these are valid email addresses, but most people don't like allowing them.
* If you don't want to allow them, and only want to allow valid domain names
* (RFC 1035, x.y.z.com, etc),
* change this constant to false.
*
*

Its default value is true to remain RFC 2822 compliant, but
* you should set it depending on what you need for your application.
*/
private static final boolean ALLOW_DOMAIN_LITERALS = true;

/**
* This contstant states that quoted identifiers are allowed
* (using quotes and angle brackets around the raw address) are allowed, e.g.:
*
*

"John Smith" <john.smith@somewhere.com>
*
*

The RFC says this is a valid mailbox. If you don't want to
* allow this, because for example, you only want users to enter in
* a raw address (john.smith@somewhere.com - no quotes or angle
* brackets), then change this constant to false.
*
*

Its default value is true to remain RFC 2822 compliant, but
* you should set it depending on what you need for your application.
*/
private static final boolean ALLOW_QUOTED_IDENTIFIERS = true;

// RFC 2822 2.2.2 Structured Header Field Bodies
private static final String wsp = "[ \\t]"; //space or tab
private static final String fwsp = wsp + "*";

//RFC 2822 3.2.1 Primitive tokens
private static final String dquote = "\\\"";
//ASCII Control characters excluding white space:
private static final String noWsCtl = "\\x01-\\x08\\x0B\\x0C\\x0E-\\x1F\\x7F";
//all ASCII characters except CR and LF:
private static final String asciiText = "[\\x01-\\x09\\x0B\\x0C\\x0E-\\x7F]";

// RFC 2822 3.2.2 Quoted characters:
//single backslash followed by a text char
private static final String quotedPair = "(\\\\" + asciiText + ")";

//RFC 2822 3.2.4 Atom:
private static final String atext = "[a-zA-Z0-9\\!\\#\\$\\%\\&\\'\\*\\+\\-\\/\\=\\?\\^\\_\\`\\{\\|\\}\\~]";
private static final String atom = fwsp + atext + "+" + fwsp;
private static final String dotAtomText = atext + "+" + "(" + "\\." + atext + "+)*";
private static final String dotAtom = fwsp + "(" + dotAtomText + ")" + fwsp;

//RFC 2822 3.2.5 Quoted strings:
//noWsCtl and the rest of ASCII except the doublequote and backslash characters:
private static final String qtext = "[" + noWsCtl + "\\x21\\x23-\\x5B\\x5D-\\x7E]";
private static final String qcontent = "(" + qtext + "|" + quotedPair + ")";
private static final String quotedString = dquote + "(" + fwsp + qcontent + ")*" + fwsp + dquote;

//RFC 2822 3.2.6 Miscellaneous tokens
private static final String word = "((" + atom + ")|(" + quotedString + "))";
private static final String phrase = word + "+"; //one or more words.

//RFC 1035 tokens for domain names:
private static final String letter = "[a-zA-Z]";
private static final String letDig = "[a-zA-Z0-9]";
private static final String letDigHyp = "[a-zA-Z0-9-]";
private static final String rfcLabel = letDig + "(" + letDigHyp + "{0,61}" + letDig + ")?";
private static final String rfc1035DomainName = rfcLabel + "(\\." + rfcLabel + ")*\\." + letter + "{2,6}";

//RFC 2822 3.4 Address specification
//domain text - non white space controls and the rest of ASCII chars not including [, ], or \:
private static final String dtext = "[" + noWsCtl + "\\x21-\\x5A\\x5E-\\x7E]";
private static final String dcontent = dtext + "|" + quotedPair;
private static final String domainLiteral = "\\[" + "(" + fwsp + dcontent + "+)*" + fwsp + "\\]";
private static final String rfc2822Domain = "(" + dotAtom + "|" + domainLiteral + ")";

private static final String domain = ALLOW_DOMAIN_LITERALS ? rfc2822Domain : rfc1035DomainName;

private static final String localPart = "((" + dotAtom + ")|(" + quotedString + "))";
private static final String addrSpec = localPart + "@" + domain;
private static final String angleAddr = "<" + addrSpec + ">";
private static final String nameAddr = "(" + phrase + ")?" + fwsp + angleAddr;
private static final String mailbox = nameAddr + "|" + addrSpec;

//now compile a pattern for efficient re-use:
//if we're allowing quoted identifiers or not:
private static final String patternString = ALLOW_QUOTED_IDENTIFIERS ? mailbox : addrSpec;
public static final Pattern VALID_PATTERN = Pattern.compile(patternString);

Anyway, the above java code allows you to do things like the following.

In the EmailAddress class, you can have a method:

public static boolean isValid( String userEnteredEmailString ) {
return VALID_PATTERN.matcher( userEnteredEmailString ).matches();
}

Then you can write validation logic wherever you want (hopefully in a dedicated Validator ;) ):

if ( !EmailAddress.isValid( userEnteredEmailString ) {
throw InvalidFormatException( "Invalid e-mail format!" );
}

Better yet, if you want to see if any email address instance is valid, the EmailAddress class has the following method that you can use for ‘pure’ OO ‘messaging’ (i.e. a method invoked on an object is a ‘message’ from the calling object to the target object):

public boolean isValid() {
//use static method call as helper w/ class attribute 'text'
return isValid( getText() );
}

which enables you to do checks this way (this is ‘pure’ OO):

if ( anEmailAddressInstance.isValid() ) {
//do something
} else {
//do something else
}

Happy validating!

60 thoughts on “Email Validation using Regular Expressions (the Right Way)

  1. MARKETING-CALLS@NO-MEDIA-OR-RHINOMARKETING-CALLSCONTROLALTDELETE.CO.ZA

    evaluates fine.

    MARKETING-CALLS@NO-MEDIA-OR-RHINOMARKETING-CALLSCONTROLALT.DELETENO-MEDIA-OR-RHINOMARKETING-CALLSCONTROLALTDELETE.CO.ZA

    even faster ;)

    the problem lies with the LocalPart and its length. Not with the domain.

    I believe the LocalPart can be broken in parts+domain and if the part evaluate fine, so will the whole and performance of the evaluation will increase.

    Not sure what the criteria would be for breaking the LocalPart in chunks.

    NO-MEDIA-OR-RHINOMARKETING-CALLS@CONTROLALTDELETE.CO.ZA

    could be split in
    NO-MEDIA-OR-RHINO@CONTROLALTDELETE.CO.ZA
    and
    MARKETING-CALLS@CONTROLALTDELETE.CO.ZA

    if both evaluate fine than the whole is fine.

    I guess where to put the split is not trivial with the more exotic LocalParts.

  2. I am happy to find much useful information in the post, writing sequence is awesome, I always look for quality content, thanks for sharing.

  3. Pingback: Validating Email Address in Web Forms – The Hazards of Complexity : Ben Gross, PhD

  4. excellent stuff. Do you have an RSS feed? And also will it be cool if I added in your feed to a blog of mine? I have a website that pulls content via RSS feeds via a several websites and I’d like to include yours, most folks do not mind considering I link back and everything but I like to get authorization first. Anyway let me know if you can, thanks.

  5. > 3.4.1. Addr-spec specification
    > Comments and folding white space SHOULD NOT be used
    > around the “@” in the addr-spec

    otherwise the email address ‘ a @b.com’ is valid.

    rfc2822 as Internet Message Format and the section ’3.4.1. Addr-spec specification’ is most relevant for the purpose of your work here. i am not quite understand why the need to refer to RFC 2822 2.2.2 Structured Header Field Bodies in your code. could you please explain? thanks.

  6. It seems that the regular expression is incorrect. Not in a way that it does not follow the RFC, but due to it’s catastrophic backtracking. See http://www.regular-expressions.info/catastrophic.html for more information about runaway expressions.

    Try the following code, you will see that evaluation takes twice as long for every added character.

    String email = “”;
    for (int i = 0; i < 30; i++)
    {
    long startTime = System.currentTimeMillis();
    VALID_PATTERN.matcher(email).matches();
    System.out.println(String.format("String length: %02d Time: %s milliseconds", email.length(), System.currentTimeMillis() – startTime));
    email += "a";
    }

  7. Hi Robert,

    You’re right – this implementation has not been revised for improved performance. I wrote it to 1) find a correct validator first and 2) then iterate on it for performance.

    A very simple way of making this a lot faster is to split on the ‘@’ character and use two separate expressions for the local part and the domain. I’m sure there are other techniques for making this even faster – if you have any recommendations, I’d love to hear them.

    Cheers,

    Les

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>