Java Email Address class

UPDATE: This Java class was updated on February 1st, 2008 to account for domain literals and quoted strings such as “John Smith” <john.smith@somewhere.com>. It is now effectively the only complete and semantically correct email validator for Java.

PETTY REQUEST: The update required considerably more effort than the original as it now accounts for all valid RFC parsing conditions. Because of this, and that this page is easily my most visited, I’d appreciate it if you could show your appreciation by hooking a brother up and clicking on some ads. It helps pay for my hosting. Thanks!

I’m writing a new open source CMS (Community Management System) based entirely on the Spring Framework and Hibernate.

This CMS will take many cues from Drupal and other PHP based CMS systems, but be far more flexible, have fantastic OO and design pattern architectures and be a benefit to the Java Enterprise software community. I’ve decided to abandon JSR-168 support – I want a much cleaner, easier to implement, OO-based and typesafe “plugin” support framework – which I’m working on now. Suffice it to say I’m not a fan of JSR-168, but that’s a whole ‘nuther post.

Anyway, In this CMS, I’m using time-honored OO classes I’ve used on many many projects. One such is the EmailAddress class that I’ve referenced in earlier posts in this blog for email address validation. I’ve gotten some good feedback on this class, so I thought I’d just post the whole thing in case anyone wants to benefit from it (instead of just using code chunks I’ve posted before).

Here it is:

/*
* Copyright 2008 Les Hazlewood
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

import java.io.Serializable;
import java.util.regex.Pattern;

/**
* An email address represents the textual string of an
* RFC 2822 email address and other corresponding
* information of interest.
*
*

If you use this code, please keep the author information in tact and reference
* my site at leshazlewood.com. Thanks!
*
* @author Les Hazlewood
*/
public class EmailAddress implements Serializable {

/**
* This constant states that domain literals are allowed in the email address, e.g.:
*
*

someone@[192.168.1.100] or

* john.doe@[23:33:A2:22:16:1F] or

* me@[my computer]

*
*

The RFC says these are valid email addresses, but most people don't like allowing them.
* If you don't want to allow them, and only want to allow valid domain names
* (RFC 1035, x.y.z.com, etc),
* change this constant to false.
*
*

Its default value is true to remain RFC 2822 compliant, but
* you should set it depending on what you need for your application.
*/
private static final boolean ALLOW_DOMAIN_LITERALS = true;

/**
* This contstant states that quoted identifiers are allowed
* (using quotes and angle brackets around the raw address) are allowed, e.g.:
*
*

"John Smith" <john.smith@somewhere.com>
*
*

The RFC says this is a valid mailbox. If you don't want to
* allow this, because for example, you only want users to enter in
* a raw address (john.smith@somewhere.com - no quotes or angle
* brackets), then change this constant to false.
*
*

Its default value is true to remain RFC 2822 compliant, but
* you should set it depending on what you need for your application.
*/
private static final boolean ALLOW_QUOTED_IDENTIFIERS = true;

// RFC 2822 2.2.2 Structured Header Field Bodies
private static final String wsp = "[ \\t]"; //space or tab
private static final String fwsp = wsp + "*";

//RFC 2822 3.2.1 Primitive tokens
private static final String dquote = "\\\"";
//ASCII Control characters excluding white space:
private static final String noWsCtl = "\\x01-\\x08\\x0B\\x0C\\x0E-\\x1F\\x7F";
//all ASCII characters except CR and LF:
private static final String asciiText = "[\\x01-\\x09\\x0B\\x0C\\x0E-\\x7F]";

// RFC 2822 3.2.2 Quoted characters:
//single backslash followed by a text char
private static final String quotedPair = "(\\\\" + asciiText + ")";

//RFC 2822 3.2.4 Atom:
private static final String atext = "[a-zA-Z0-9\\!\\#\\$\\%\\&\\'\\*\\+\\-\\/\\=\\?\\^\\_\\`\\{\\|\\}\\~]";
private static final String atom = fwsp + atext + "+" + fwsp;
private static final String dotAtomText = atext + "+" + "(" + "\\." + atext + "+)*";
private static final String dotAtom = fwsp + "(" + dotAtomText + ")" + fwsp;

//RFC 2822 3.2.5 Quoted strings:
//noWsCtl and the rest of ASCII except the doublequote and backslash characters:
private static final String qtext = "[" + noWsCtl + "\\x21\\x23-\\x5B\\x5D-\\x7E]";
private static final String qcontent = "(" + qtext + "|" + quotedPair + ")";
private static final String quotedString = dquote + "(" + fwsp + qcontent + ")*" + fwsp + dquote;

//RFC 2822 3.2.6 Miscellaneous tokens
private static final String word = "((" + atom + ")|(" + quotedString + "))";
private static final String phrase = word + "+"; //one or more words.

//RFC 1035 tokens for domain names:
private static final String letter = "[a-zA-Z]";
private static final String letDig = "[a-zA-Z0-9]";
private static final String letDigHyp = "[a-zA-Z0-9-]";
private static final String rfcLabel = letDig + "(" + letDigHyp + "{0,61}" + letDig + ")?";
private static final String rfc1035DomainName = rfcLabel + "(\\." + rfcLabel + ")*\\." + letter + "{2,6}";

//RFC 2822 3.4 Address specification
//domain text - non white space controls and the rest of ASCII chars not including [, ], or \:
private static final String dtext = "[" + noWsCtl + "\\x21-\\x5A\\x5E-\\x7E]";
private static final String dcontent = dtext + "|" + quotedPair;
private static final String domainLiteral = "\\[" + "(" + fwsp + dcontent + "+)*" + fwsp + "\\]";
private static final String rfc2822Domain = "(" + dotAtom + "|" + domainLiteral + ")";

private static final String domain = ALLOW_DOMAIN_LITERALS ? rfc2822Domain : rfc1035DomainName;

private static final String localPart = "((" + dotAtom + ")|(" + quotedString + "))";
private static final String addrSpec = localPart + "@" + domain;
private static final String angleAddr = "<" + addrSpec + ">";
private static final String nameAddr = "(" + phrase + ")?" + fwsp + angleAddr;
private static final String mailbox = nameAddr + "|" + addrSpec;

//now compile a pattern for efficient re-use:
//if we're allowing quoted identifiers or not:
private static final String patternString = ALLOW_QUOTED_IDENTIFIERS ? mailbox : addrSpec;
public static final Pattern VALID_PATTERN = Pattern.compile(patternString);

//class attributes
private String text;
private boolean bouncing = true;
private boolean verified = false;
private String label;

public EmailAddress() {
super();
}

public EmailAddress(String text) {
super();
setText(text);
}

/**
* Returns the actual email address string, e.g. someone@somewhere.com
*
* @return the actual email address string.
*/
public String getText() {
return text;
}

public void setText(String text) {
this.text = text;
}

/**
* Returns whether or not any emails sent to this email address come back as bounced
* (undeliverable).
*
*

Default is false for convenience's sake - if a bounced message is ever received for this
* address, this value should be set to true until verification can made.
*
* @return whether or not any emails sent to this email address come back as bounced
* (undeliverable).
*/
public boolean isBouncing() {
return bouncing;
}

public void setBouncing(boolean bouncing) {
this.bouncing = bouncing;
}

/**
* Returns whether or not the party associated with this email has verified that it is
* their email address.
*
*

Verification is usually done by sending an email to this
* address and waiting for the party to respond or click a specific link in the email.
*
*

Default is false.
*
* @return whether or not the party associated with this email has verified that it is
* their email address.
*/
public boolean isVerified() {
return verified;
}

public void setVerified(boolean verified) {
this.verified = verified;
}

/**
* Party label associated with this address, for example, 'Home', 'Work', etc.
*
* @return a label associated with this address, for example 'Home', 'Work', etc.
*/
public String getLabel() {
return label;
}

public void setLabel(String label) {
this.label = label;
}

/**
* Returns whether or not the text represented by this object instance is valid
* according to the RFC 2822 rules.
*
* @return true if the text represented by this instance is valid according
* to RFC 2822, false otherwise.
*/
public boolean isValid() {
return isValidText(getText());
}

/**
* Utility method that checks to see if the specified string is a valid
* email address according to the * RFC 2822 specification.
*
* @param email the email address string to test for validity.
* @return true if the given text valid according to RFC 2822, false otherwise.
*/
public static boolean isValidText(String email) {
return (email != null) && VALID_PATTERN.matcher(email).matches();
}

public boolean equals(Object o) {
if (o instanceof EmailAddress) {
EmailAddress ea = (EmailAddress) o;
return getText().equals(ea.getText());
}
return false;
}

public int hashCode() {
return getText().hashCode();
}

public String toString() {
return getText();
}

public static void main(String[] args) {
String addy = "\"John Smith\" ";
if (isValidText(addy)) {
System.out.println("Valid email address.");
} else {
System.out.println("Invalid email address!");
}
}
}

42 thoughts on “Java Email Address class

  1. Pingback: Les Hazlewood » Java Email Address Validation using Regular Expressions (the Right Way)

  2. Hi Les,

    I am doing a project for one of my courses at University, and I was thinking of using parts of your email class. While looking through your specification of the RFC2822 regular expression, I noticed a few parts missing. What happened to the “quoted-string” and “domain-literal” identifiers? Also, could you quickly explain what the “^” and “$” at the start and end of your final regular expression do?

    Thanks for your help,

    Sebastian

  3. Hi Sebastian,

    To be honest, I only incorporated rfc2822 for standard email text addresses (without quoted text). The domain-literals are represented by the RFC 1035 domain tokens in the class. Since you pointed this out, I’m now working on including the quoted-string tokens into the email address class’s final regular expression. Thanks for pointin that out!

    Also, the ^ character, when not inside a character class, means “the stuff after me must start the string”. So in the final regexp in the class, it means “Match all strings where the beginning of the string is the localpart”. Similarly, the $ character means “the stuff before me must finish the string”. So for the final regexp, it means “match all strings where the end of the string is the domain.” Putting the two together in the same regexp means “the string must start with a localpart and must end with a domain”, with of course, those two being seperated by the @ character.

    Cheers,

    Les

  4. Hi Les,

    I noticed a bug in your class. Testing the email host.@domain.com will return as being valid. The reason for this I believe is due to your use of the raw symbols in the “sp” token definition. If you change the symbols to the ascii hex characters, it works better!

    Regards,

    Sebastian

  5. @Sebastian

    Thanks! I thought something was funny. Instead, I just escaped each of the characters in the ‘sp’ constant in the file (using double backslashes). The blog entry has been updated with the change for future reference.

    Cheers,

    Les

  6. Hi Les,

    According to the spec, shouldn’t the email address:
    “blah”@blah.com
    be allowed?

    Thanks,

    Matt

  7. Hi Matt,

    The latest update to this blog adds support for quoted strings and domain literals, properly validating “blah”@blah.com as valid.

    Cheers,

    Les

  8. Very nice code, cheers.

    One thing I noticed though is that using the code supplied, the sample string keeps coming back invalid for me:

    String addy = “\”John Smith\” “;

    this is after setting ALLOW_DOMAIN_LITERALS and ALLOW_QUOTED_IDENTIFIERS both to true.

    Not sure why this is, and it is not relevant for my app (since this format is not allowed) but either there is a bug or I messed something up in the copy/paste :)

  9. Hi – thanks, great work, very glad to see it.

    Might be worth mentioning in your post above that the parser does not include the “obsolete” parts of the address syntax which are a part of 2822, and, according to sec 4, “MUST be accepted and parsed by a conformant receiver”.

    Unless I’m misunderstanding something.

    -c

  10. Hi, some further notes.

    I noticed that CFWS is not included in your parser. We needed that, since we’re doing checking of addresses out of emails, so I went ahead and (partially) implemented it. I’m new to this stuff, so if you have the time to review the code I’d greatly appreciate it. I may well have made some mistakes!! Of course you are welcome to include it in your own code, if you wish. I tested it on ~1700 real-world addresses and there weren’t any false-negatives. Didn’t check for false-positives yet.

    I say “partially” because under 2822, comments in CFWS are allowed to nest, but the structure of strings inside of strings doesn’t allow this. So only one-level comments are possible. E.g. the valid address:

    “Bob Smith” (Bob Smith)

    works, but the valid address:

    “Bob Smith” (Bob (the man) Smith)

    won’t. Not a deal breaker for us. :-)

    I also added a flag to permit a “.” in unquoted text, e.g., allowed:
    Superstore.com

    I also added a flag to permit “[" and "]” in the same place but I turned it off because it seemed to cause an extremely long delay in the parsing.

    What I did (maybe I should have made CFWS handling a switch, but I didn’t):

    Added:

    /**
    * This constant allows “.” to appear in atext.
    *
    * The address:
    * Kayaks.org
    * …is not valid. It should be:
    * “Kayaks.org”
    *
    * If this boolean is set to false, the parser will act per 2822 and will require
    * the quotes; if set to true, it will allow this.
    */
    private static final boolean ALLOW_DOT_IN_ATEXT = true;

    /**
    * This constant allows “.” to appear in atext.
    *
    * The address:
    * [Kayaks]
    * …is not valid. It should be:
    * “[Kayaks]”
    *
    * If this boolean is set to false, the parser will act per 2822 and qill require
    * the quotes; if set to true, it will allow this.
    *
    * WARNING: This may be a bug, but it seems like this can cause the parser to hang
    * for a while before completing (apparently accurately), e.g. on the corrupted address string:
    *
    * Bob Smith [mailto:bob@gmail.com]=20
    */
    private static final boolean ALLOW_SQUARE_BRACKETS_IN_ATEXT = false;

    [note: this section is added just after section 3.2.2]

    // RFC 2822 3.2.3 CFWS specification
    // note: nesting should be permitted but is not by these rules given code limitations:
    private static final String ctext = “[" + noWsCtl + "\\x21-\\x27\\x2A-\\x5B\\x5D-\\x7E]“;
    private static final String ccontent = ctext + “|” + quotedPair; // + “|” + comment;
    private static final String comment = “\\((” + fwsp + ccontent + “+)*” + fwsp + “\\)”;
    private static final String cfws = “(” + fwsp + comment + “+)*((” + fwsp + comment +
    “+)|” + fwsp + “+)+”;

    [The following lines already existed, but they were modified. Shown here in order, but there is lots of intervening code in some cases:]

    private static final String atext = “[a-zA-Z0-9\\!\\#\\$\\%\\&\\'\\*\\+\\-\\/\\=\\?\\^\\_\\`\\{\\|\\}\\~" + (ALLOW_DOT_IN_ATEXT ? "\\." : "") + (ALLOW_SQUARE_BRACKETS_IN_ATEXT ? "\\[\\]” : “”) + “]”;

    private static final String atom = cfws + atext + “+” + cfws;

    private static final String dotAtom = cfws + “(” + dotAtomText + “)” + cfws;

    private static final String quotedString = cfws + dquote + “(” + fwsp + qcontent + “)*” + fwsp + dquote + cfws;

    private static final String domainLiteral = cfws + “\\[" + "(" + fwsp + dcontent + "+)*" + fwsp + "\\]” + cfws;

    private static final String angleAddr = cfws + “” + cfws;

  11. About that code i submitted: it implements strict 2822, so the addresses:
    bob @example.com
    and
    bobjones(comment)@example.com
    are both valid, even though the spec says you “SHOULD NOT” do that, becase CFWS is allowed after the dot-atom on the left side of the @…
    -c

  12. [ooops, this was also posted at http://leshazlewood.com/?p=5 -- Les, perhaps you could erase my previous comments on this page, since this supersedes them. Thanks.]

    Hi there!

    I wanted to let you know that I have taken your code and added a number of features to it. I post the link here in case it’s useful to you or anyone reading this. Essentially it adds a number of functions for extracting addresses (and parts of addresses), as well as verifying whole headers (including group tokens, etc.)

    You can find it (along with documentation, etc) at:

    http://boxbe.com/freebox.html

    Modified/added: removed some functions, added support for CFWS token, corrected FWSP token, added some boolean flags, added getInternetAddress and extractHeaderAddresses and other methods, did some optimization of the regex.

    Where Mr. Hazlewood’s version was more for ensuring certain forms that were passed in during registrations, etc, this handles more types of verifying as well a few forms of extracting the data in predictable, cleaned-up chunks.

    (I see that you removed my other rambling comments, which I was going to ask you to do anyway. :-) )

    Thanks again,
    -Casey

  13. Should bademail@squeak?.com be considered a valid email address?

    I’m not looking at the pertinent rfc right now, but it doesn’t look like a valid email address to me.

  14. The domainname specifies that it follows RFC1035, however this RFC states the following:

    The labels must follow the rules for ARPANET host names. They must
    start with a letter
    , end with a letter or digit, and have as interior
    characters only letters, digits, and hyphen. There are also some
    restrictions on the length. Labels must be 63 characters or less.

    So according to this http://www.3com.com is not a valid domainname, however it is according to the pattern describe above.

  15. @Nanne

    Thanks for the pointer. I think it is ok to leave in the definition I have now because if I changed it to prevent domains starting with a letter _or_ number, then obviously 3com.com wouldn’t match.

    Clearly this is a domain name resolvable by DNS and has email addresses associated with it, so its probably not a good idea to be a 100% reflection of RFC 1035. 99% is good for email, as your example demonstrates ;)

    Cheers,

    Les

  16. Nice work, I would like to thanks Les for this stuff, and all other people who had improved this solution by sharing their point of views and comments.

  17. Pingback: tunagami.com » RegExp, Java, Email, RFC 2822 & YOU!

  18. Hi Les,

    Thanks for this wonderful code. This really helps cut short a lot of googling!

    I wrote the below code to validate email ids (before I stumbled upon your code)

    import org.clapper.util.mail.EmailAddress;
    public class EmailAddress{
    public boolean validateEmail(String input){
    try {
    EmailAddress emailAddress = new EmailAddress(input);
    } catch (EmailException e) {
    return false;
    }
    return true;
    }
    }

    I am trying to understand if there would a vast difference between using this piece of code vs. your class. Please let me know your thoughts.

    Thanks
    Pavan Tumu

  19. Thanks. This looks great.
    I can’t wait to use it.
    Only problem is when I copy it it is all run together and the lines and indenting are missing so it doesn’t work.
    Can you post a link to the actual class so I can download it complete rather than trying to copy and paste.
    btw, how did anyone else manage to get this? Am I just looking at the wrong page?

    Cheers.

  20. After some hard work reformatting, I got this to work.
    My friend says this is a valid email address:
    “customer/department=shipping@example.com”
    but when I parse it java matcher hangs for about 3 minutes. It eventually returns with the correct response.
    So I have 2 questions.
    1. What’s wrong that java hangs?
    2. Is this a valid email address?
    Thanks again. It works mostly for the rest.

  21. @Hugh

    There isn’t anything wrong per se – I think Java’s RegEx Pattern Matcher becomes slow with very complex regular expressions like the one used for Email validation.

    Cheers,

    Les

  22. The regex seems to get very inefficient with long strings:

    Checking with

    addy = “abcdefghijklmnopqrs@xyz.com” : 1 sec
    addy = “abcdefghijklmnopqrstu@xyz.com” : 3 sec
    addy = “abcdefghijklmnopqrstuv@xyz.com” : 7 sec
    addy = “abcdefghijklmnopqrstuvw@xyz.com” : 14 sec
    addy = “abcdefghijklmnopqrstuvwx@xyz.com” : 27 sec
    addy = “abcdefghijklmnopqrstuvwxy@xyz.com” : 56 sec

    memory did not increase during the runs.

  23. @Ti

    You’re right, it can get slow, but that just means the Java RegularExpression mechanism is not as efficient as it could be. There’s not much I can do about that ;)

    - Les

  24. Has anyone tried an exclamation mark in the email address pattern? a!a.com causes the isValidText method to go into an infinite loop

  25. The class is great. I used a previous version some time ago. Thanks for the work!

    This comment isn’t really about the code though. You can delete the comment if you want. I just thought I should tell you that whoever is delivering your ads is really picking awful ads to put on your site..

    Find anyone’s email
    The JCrew home page.

    If you are using ad revenue to pay for hosting, you might want to find a way to convince the supplier to pick more pertinent ads.

    Good luck. BTW, I did my part. :-)

  26. @Hugh #89438 (and Les):

    I had the same problem as you did copying the code from the web page, it’s all run together into a single line with no line breaks! The way I got around this was to view the source of the page (Ctrl-U in Firefox) and copy the code from the source window, which preserved the line breaks correctly. However, it was then necessary to replace all occurrences of > < and & with >

  27. OK, as I suspected the characters I wanted to appear in my previous post didn’t appear! What I meant to say was that all occurences of >

  28. Many thanks! Here’s a paragraph I added to my copy:

    // Legal top-level domains:
    // http://data.iana.org/TLD/tlds-alpha-by-domain.txt

    private Set TLDs;

    public EmailValidator() {
    this.TLDs = new HashSet();

    try {
    InputStream tldIs = Resources.getResourceAsStream("properties/tlds-alpha-by-domain.txt");
    DataInputStream tldDis = new DataInputStream(tldIs);
    BufferedReader tldBr = new BufferedReader(new InputStreamReader(tldDis));
    String strLine;
    // Read file line by line
    while ((strLine = tldBr.readLine()) != null) {
    if(StringUtils.isBlank(strLine)){
    continue;
    }
    if("#".equals(strLine.substring(0,1))){
    continue;
    }
    TLDs.add(strLine);
    }
    // Close the input stream
    tldDis.close();
    } catch (Exception e) {// Catch exception if any
    System.err.println("Error: " + e.getMessage());
    }
    }

    public boolean isValidTopLevelDomain(String email){
    if(StringUtils.isBlank(email)){
    return(false);
    }
    String[] domains = email.split("\\.");
    if(domains.length < 2){
    return(false);
    }
    String topLevelDomain = domains[domains.length - 1].toUpperCase();
    if(TLDs.contains(topLevelDomain)){
    return(true);
    }
    return(false);
    }

  29. Unfortunately the running time for long email addresses prohibits the use of your surely great code in most projects…

  30. @Oliver

    That may be the case due to the Java regex parser implementation.

    I think it’s about time to create an open source project for this and focus on speed. We’ll probably need to turn it in to a combo of scanning + regex, but the community can definitely benefit from this.

    Regards,

    Les

  31. Hi again -

    for those having issues with speed and recursion, or needing the ability to extract parts of address or headers, i rewrote a lot of the code to help with that:

    http://boxbe.com/freebox.html

    We have used it for several years to relatively-efficiently parse/process several billion in-the-wild addresses. There _are_ still a few lingering hangups on highly unusual addresses (i.e. spam garbage) that bring down a server from time to time (working on that, don’t hold your breath), but it’s otherwise pretty solid. It’s about 3-4 times slower than the (much, much simpler) JavaMail parser.

    I agree with Les that an open-source project is in order (don’t have the time to set it up now, sorry). I was given wise-sounding advice that use of a lexer is in order for serious efficiency (perhaps JFlex? JavaCC?), since no solution based on regex is ever going to be fast and accurate both, given the formidable (some say “insanely unparsable”) flexibility of 2822.

    Hope that helps.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>