Email Validation using Regular Expressions (the Right Way)
UPDATE: This article was updated on February 1st, 2008 to account for domain literals and quoted strings such as "John Smith" <john.smith@somewhere.com>. It is now effectively the only complete and semantically correct email validator for Java.
PETTY REQUEST: The update required considerably more effort than the original as it now accounts for all valid RFC parsing conditions. Because of this, and that this page is easily my most visited, I'd appreciate it if you could show your appreciation by hooking a brother up and clicking on some ads. It helps pay for my hosting. Thanks!
In Object-Oriented design, I'm a firm believer in modeling things in they way they truly exist (in as much is possible given abstraction and time restrictions). So, whenver I design a system's domain model, I create Classes that represent entities as they exist in real life. That being said, I've accrued a nice library of Classes that I reuse in a lot of projects.
For example, I don't save or reference an email address as a String: strings as objects don't tell me anything about the email address itself, like if its valid, if its bouncing, if it has been verified by the user with which it is associated, etc, etc. As such, I have created an EmailAddress class to represent this information. Doing this is a small example of the beauty of OO over functional programming.
Anyway, I was a little lax in the past in my validation logic. This time on my last project, I was determined to get things right once and for all.
I googled quite a while for the Right Way to validate an email address. In my opinion, there is only one Right Way - the RFC 2822 way. This is the standard after all.
I never came across anything I was happy with. All the responses seemed to be perl or php variant regular experessions or some horribly convoluted text string nearly impossible to decipher. I was disappointed to see so many interpretations of a standard. I mean, c'mon people, its written in pure black and white!!!
I guess the old addage "If you want something done right, you've got to do it yourself" resonated in my head this time. I actually took the time out to read the RFC (something I hadn't done in a long while, probably since college).
After reading the RFC, I translated the grammar into usable, *readable* source code that now resides in my EmailAddress class, and I've included it below for the benefit of anyone that wishes to use it. It is written in Java, but the same code could be replicated in C# or PHP or whatever. Just keep it clean!
N.B: Look at the to the first two constants, ALLOW_DOMAIN_LITERALS and ALLOW_QUOTED_IDENTIFIERS - enable or disable them as you see fit for your application.
/*
* Copyright 2008 Les Hazlewood
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/**
* This constant states that domain literals are allowed in the email address, e.g.:
*
*
someone@[192.168.1.100] or
* john.doe@[23:33:A2:22:16:1F] or
* me@[my computer]
*
*
The RFC says these are valid email addresses, but most people don't like allowing them.
* If you don't want to allow them, and only want to allow valid domain names
* (RFC 1035, x.y.z.com, etc),
* change this constant to false.
*
*
Its default value is true to remain RFC 2822 compliant, but
* you should set it depending on what you need for your application.
*/
private static final boolean ALLOW_DOMAIN_LITERALS = true;
/**
* This contstant states that quoted identifiers are allowed
* (using quotes and angle brackets around the raw address) are allowed, e.g.:
*
*
"John Smith" <john.smith@somewhere.com>
*
*
The RFC says this is a valid mailbox. If you don't want to
* allow this, because for example, you only want users to enter in
* a raw address (john.smith@somewhere.com - no quotes or angle
* brackets), then change this constant to false.
*
*
Its default value is true to remain RFC 2822 compliant, but
* you should set it depending on what you need for your application.
*/
private static final boolean ALLOW_QUOTED_IDENTIFIERS = true;
// RFC 2822 2.2.2 Structured Header Field Bodies
private static final String wsp = "[ \\t]"; //space or tab
private static final String fwsp = wsp + "*";
//RFC 2822 3.2.1 Primitive tokens
private static final String dquote = "\\\"";
//ASCII Control characters excluding white space:
private static final String noWsCtl = "\\x01-\\x08\\x0B\\x0C\\x0E-\\x1F\\x7F";
//all ASCII characters except CR and LF:
private static final String asciiText = "[\\x01-\\x09\\x0B\\x0C\\x0E-\\x7F]";
// RFC 2822 3.2.2 Quoted characters:
//single backslash followed by a text char
private static final String quotedPair = "(\\\\" + asciiText + ")";
//RFC 2822 3.2.4 Atom:
private static final String atext = "[a-zA-Z0-9\\!\\#\\$\\%\\&\\'\\*\\+\\-\\/\\=\\?\\^\\_\\`\\{\\|\\}\\~]";
private static final String atom = fwsp + atext + "+" + fwsp;
private static final String dotAtomText = atext + "+" + "(" + "\\." + atext + "+)*";
private static final String dotAtom = fwsp + "(" + dotAtomText + ")" + fwsp;
//RFC 2822 3.2.5 Quoted strings:
//noWsCtl and the rest of ASCII except the doublequote and backslash characters:
private static final String qtext = "[" + noWsCtl + "\\x21\\x23-\\x5B\\x5D-\\x7E]";
private static final String qcontent = "(" + qtext + "|" + quotedPair + ")";
private static final String quotedString = dquote + "(" + fwsp + qcontent + ")*" + fwsp + dquote;
//RFC 2822 3.2.6 Miscellaneous tokens
private static final String word = "((" + atom + ")|(" + quotedString + "))";
private static final String phrase = word + "+"; //one or more words.
//RFC 1035 tokens for domain names:
private static final String letter = "[a-zA-Z]";
private static final String letDig = "[a-zA-Z0-9]";
private static final String letDigHyp = "[a-zA-Z0-9-]";
private static final String rfcLabel = letDig + "(" + letDigHyp + "{0,61}" + letDig + ")?";
private static final String rfc1035DomainName = rfcLabel + "(\\." + rfcLabel + ")*\\." + letter + "{2,6}";
//RFC 2822 3.4 Address specification
//domain text - non white space controls and the rest of ASCII chars not including [, ], or \:
private static final String dtext = "[" + noWsCtl + "\\x21-\\x5A\\x5E-\\x7E]";
private static final String dcontent = dtext + "|" + quotedPair;
private static final String domainLiteral = "\\[" + "(" + fwsp + dcontent + "+)*" + fwsp + "\\]";
private static final String rfc2822Domain = "(" + dotAtom + "|" + domainLiteral + ")";
private static final String domain = ALLOW_DOMAIN_LITERALS ? rfc2822Domain : rfc1035DomainName;
private static final String localPart = "((" + dotAtom + ")|(" + quotedString + "))";
private static final String addrSpec = localPart + "@" + domain;
private static final String angleAddr = "<" + addrSpec + ">";
private static final String nameAddr = "(" + phrase + ")?" + fwsp + angleAddr;
private static final String mailbox = nameAddr + "|" + addrSpec;
//now compile a pattern for efficient re-use:
//if we're allowing quoted identifiers or not:
private static final String patternString = ALLOW_QUOTED_IDENTIFIERS ? mailbox : addrSpec;
public static final Pattern VALID_PATTERN = Pattern.compile(patternString);
Anyway, the above java code allows you to do things like the following.
In the EmailAddress class, you can have a method:
public static boolean isValid( String userEnteredEmailString ) {
return VALID_PATTERN.matcher( userEnteredEmailString ).matches();
}Then you can write validation logic wherever you want (hopefully in a dedicated Validator
):
if ( !EmailAddress.isValid( userEnteredEmailString ) {
throw InvalidFormatException( "Invalid e-mail format!" );
}Better yet, if you want to see if any email address instance is valid, the EmailAddress class has the following method that you can use for 'pure' OO 'messaging' (i.e. a method invoked on an object is a 'message' from the calling object to the target object):
public boolean isValid() {
//use static method call as helper w/ class attribute 'text'
return isValid( getText() );
}which enables you to do checks this way (this is 'pure' OO):
if ( anEmailAddressInstance.isValid() ) {
//do something
} else {
//do something else
}Happy validating!
April 4th, 2006 - 17:56
Thanks Les for doing the hard work of implementing RFC 2822. I don’t why their are so many personal interpretations of what a vaild email address is or why so few actually bothered with the RFC standard.
Anyway thanks. I have just one minor correction and that is Pattern does not have:
boolean Pattern.matches(String s)
You need to create:
matcher = Pattern.matcher( CharSequence)
and then return matcher.matches().
Steven
April 4th, 2006 - 21:47
Ah, yes, thanks very much for pointing that out
I’ve updated the blog entry accordingly.
Cheers,
Les
October 31st, 2006 - 02:47
I tried using this code. But, its saying a@b is a valid email address. Is it?!
November 6th, 2006 - 11:41
Hi Bupesh,
a@b is not a valid email address. But the code works as expected – I just used a@b through a simple test:
EmailAddress emailAddy = new EmailAddress( "a@b" );if ( !emailAddy.isValid() ) {
System.out.printlin( "Email is not valid!" );
} else {
System.out.println( "Email is valid" );
}
When I ran that code block, my console printed: “Email is not valid!”.
So the code works as expected.
Cheers,
Les
December 24th, 2006 - 01:14
Will it work for “.a@bbb.com”? actually it shouldn’ work but it does!
January 15th, 2007 - 02:53
Thanks,
This code save me a lot of time.
I am having one strange thing happen. This seems to accept first,last@site.com as a valid email address. I don’t see a comma in any of the patterns but yet it is accepting a comma as valid in localpart. Any ideas?
Thanks again,
Al Medeiros
February 2nd, 2007 - 08:03
I think I foud an error in your expression as it allows an email address to start with a single quote ‘.
Which is surely not valid, javamail doesnt accept it.
February 10th, 2007 - 15:41
@Nithya
.a@bbb.com does not show up as a valid email address.
My very simple test program tells me it is invalid, so the expression is correct.
For example, the following code does in fact print out “Invalid email.”:
String email = ".a@bbb.com";if ( EmailAddress.isValidText( email ) ) {
System.out.println("Valid email!");
} else {
System.out.println("Invalid email.");
}
February 10th, 2007 - 15:45
@Al
The expression is correct. first,last@site.com is not a valid address, as you point out. This code chunk does print out “Invalid email.”:
String email = "first,last@site.com";if ( EmailAddress.isValidText( email ) ) {
System.out.println("Valid email!");
} else {
System.out.println("Invalid email.");
}
February 10th, 2007 - 15:49
@Hans
The the above regular expression is still correct. An email address, per the RFC 2822 spec is allowed to start with a single quote, or any other character in the atext constant above.
Javamail doesn’t have any internal email address validation that I’m aware of, so Javamail isn’t denying the email per se – it is probably your underlying email server that javamail connects to that is saying the email is invalid. In this case, the email server is wrong – at least according to the RFC spec. The expression is still accurate.
Cheers,
Les
March 23rd, 2007 - 01:16
thanks for the neatly written code.
but it does not validate the email address ending with an IP , such as don@[18.138.9.10]
Isn’t this a valid mail id???
March 23rd, 2007 - 03:26
@Kumar,
You’re absolutely correct. don@[18.138.9.10] is a valid email address. So is a quoted identifier, i.e. “Don Somebody” <don@[18.138.9.10]>, but the expression does not account for these 2 cases. I’ll add them in soon. Thanks!
April 5th, 2007 - 08:05
Les did you add these in already?
April 5th, 2007 - 08:56
@Sateesh
Nope, not yet – I haven’t had the time
(On a consulting engagement in Dublin, Ireland for the last 2 months). I hope to address these issues now that I’m back home in the States.
Cheers,
Les
April 30th, 2007 - 07:25
Great work.
thanks!
May 13th, 2007 - 12:46
Hi Les. I am Brazillian and I am creating a very simple framework to help with validations of specific Brazillian formats like social security number. Even though e-mail isn´t one of them, I am adding some other basic validation functions wich include e-mail.
It is amazing that there isn´t a framework like that with minimal documentation already.
Anyway, I will publish that very simple framework at sourceforge.net and I was wondering if I could use your code above (regular expression part) in it and put the credits on the javadoc header. It would took me quite some time to do the same thing again myself, can I use yours?
There is no profit envolved, just a simple framework that I did for myself and will publish since will probably be of use for other people on my country.
Cheers,
Thiago.
June 5th, 2007 - 15:40
What about e-mail addresses such as:
whomever@u.washington.edu ?
The regex says that this is not valid. Yet u.washington.edu is a valid domain. (As is the similar “u.arizona.edu”.) The regex doesn’t like the lone “u”.
Sean
June 5th, 2007 - 17:31
@Sean
You’re absolutely right! Thanks for catching that. I’ve updated the blog entry accordingly (the rfcLabel definition specifically).
June 14th, 2007 - 03:29
This one makes sence “One’s first step in wisdom is to kuesstion everything – and one’s last is to come to terms with everything.”
July 4th, 2007 - 01:02
JavaMail does have email address validation, see http://java.sun.com/products/javamail/javadocs/javax/mail/internet/InternetAddress.html#validate() and the source for that at https://glassfish.dev.java.net/source/browse/glassfish/mail/src/java/javax/mail/internet/InternetAddress.java?rev=1.6&view=markup . However, as they say, “The current implementation checks many, but not all, syntax rules.” So there’s still room for your implementation.
What do you guys think of the paragraph about the RFC on http://www.regular-expressions.info/email.html ?
July 19th, 2007 - 13:46
Wow, yeah, great job.
August 14th, 2007 - 14:12
Thank you very mucho, great job. No more personal interpretation.
August 22nd, 2007 - 21:00
Thank you very much. Saved me a lot of headaches.
November 7th, 2007 - 13:58
Thank you Les. You saved lot of time for everyone of us who is trying to validate emails
. Gr8 work. This works like a charm. Thanks again.
December 15th, 2007 - 02:52
Les, thank you so much for this wonderfully simple, cleanly written and elegant email validator for Java. Now it will be easy to validate emails against the acual spec rather than some home-baked aproximation. You rock! I think you have written THE canonical implementation for Java.
January 11th, 2008 - 03:28
This is great job.I am in trouble to create regular expression for email validation .but this solve my all problems.
Thanks alot.
February 24th, 2008 - 18:09
What about e-mail addresses containing punycode; it’s a replacment for internationalized domain names (IDN) like so called “umlaut domains” (using ä, ö, ü, etc.)?
See RFC 3492 for details.
Example: mq@ยจฆฟคฏข.tld -> me@xn-22cdfh1b8fsa.tld (this is a valid punycode representation for an IDN)
May 12th, 2008 - 15:19
Hi there!
I wanted to let you know that I have taken your code and added a number of features to it. I post the link here in case it’s useful to you or anyone reading this. Essentially it adds a number of functions for extracting addresses (and parts of addresses), as well as verifying whole headers (including group tokens, etc.)
You can find it (along with documentation, etc) at:
http://boxbe.com/freebox.html
Modified/added: removed some functions, added support for CFWS token, corrected FWSP token, added some boolean flags, added getInternetAddress and extractHeaderAddresses and other methods, did some optimization of the regex.
Where Mr. Hazlewood’s version was more for ensuring certain forms that were passed in during registrations, etc, this handles more types of verifying as well a few forms of extracting the data in predictable, cleaned-up chunks.
(I see that you removed my other rambling comments, which I was going to ask you to do anyway.
)
Thanks again,
-Casey
October 16th, 2008 - 02:52
Hello
This email shows valid: test;test@example.com
But is it really? When I try to send to this I get:
javax.mail.internet.AddressException: Illegal semicolon, not in group in string “test;test@example.com”
thanks
Ioannis
November 13th, 2008 - 00:32
was about to start on this but you probably saved me a couple of hours, T
hanks A Lot!
November 23rd, 2008 - 12:15
Wow, thank you. This is very helpful.
According to Wikipedia (for what it’s worth) this address is valid:
abc+mailbox/department=shipping@example.com
It seems to cause the pattern matcher to go into an endless loop.
A similar address:
abc+mailbox/department.shipping@example.com
takes just over 7.5 seconds to validate.
The combination of ‘+’ and ‘=’ seems to be what is causing the problems.
February 2nd, 2009 - 22:13
Great work.
Just in short: setting ALLOW_DOMAIN_LITERALS will
validate a@b as valid.
Regards
March 24th, 2009 - 08:52
Has this code been updated to comply with RFC 5322 (http://www.ietf.org/rfc/rfc5322.txt) which supersedes RFC 2822.
July 16th, 2009 - 17:42
Thanks so much!!! Clicked on some ads for you too
October 13th, 2009 - 16:57
I should mention that if ALLOW_DOMAIN_LITERALS = true;
then a@b is valid but ALLOW_DOMAIN_LITERALS = false; then a@b is nnot valid
October 27th, 2009 - 07:51
Well, the best way to do using java is as follows :
————
String email = “muhdadeel@yahoo.com”;
Patter p = Pattern.compile(“.+@.+\\.[a-z]+”);
Matcher m = p.matcher(email);
boolean matchFound = m.matches();
//we have to make sure ,user dont put only a@b.c,since it should be atleast a@b.cc
StringTokenizer st = new StringTokenizer(email,”.”);
String lastToekn = null;
while(st.hasMoreTokens())
{
lastToekn = st.nextToken();
}
if(matchFound && lastToekn.length() >= 2 )
{
out.println(“Valid Email”);
}
else
{
out.println(“sorry,invalid”);
}
Thats the best way ,pals…
November 10th, 2009 - 15:19
Les, I had a requirement to allow non-ascii letters (acutes, umlauts, and such). I replaced any instance of a-zA-Z with \\p{L} and added changed the final compile step to Pattern.compile(patternString, Pattern.UNICODE_CASE). I don’t know if this deviates from the spec (too lazy) but I thought I’d pass it on. Thanks for the great class!
December 23rd, 2009 - 09:08
I incorporated this into my project 3 months ago. Today I give it the string “sdlkfjaklsdfjaskldfjaslkdjfflasda@sdffjfj” and it locks up! When breaking into the debugger, I’ve got a massive call-stack. Here’s just a small fraction of it:
java.util.regex.Pattern$6.isSatisfiedBy(Pattern.java:4763)
java.util.regex.Pattern$6.isSatisfiedBy(Pattern.java:4763)
java.util.regex.Pattern$6.isSatisfiedBy(Pattern.java:4763)
java.util.regex.Pattern$6.isSatisfiedBy(Pattern.java:4763)
java.util.regex.Pattern$6.isSatisfiedBy(Pattern.java:4763)
java.util.regex.Pattern$6.isSatisfiedBy(Pattern.java:4763)
java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
java.util.regex.Pattern$Curly.match0(Pattern.java:3760)
java.util.regex.Pattern$Curly.match(Pattern.java:3744)
java.util.regex.Pattern$Curly.match0(Pattern.java:3789)
java.util.regex.Pattern$Curly.match(Pattern.java:3744)
February 10th, 2010 - 09:06
I’ve got the same error as PUK.
For a 50 character email address the code just hangs. You’ll be amazed to see that when running it in an web application the entire server will hang.
Any ideas?
February 10th, 2010 - 09:24
@Puk and @Horia
Clearly the Java RegExp Parser is having problems with long/complex regular expressions – nothing I can do about that
My best advice would be to break the regular expression into 2 parts, split at the ‘@’ character. Perform one regex for just the localPart and then another for the domain. That should reduce the complexity a decent amount – but I still have no idea how the parser would perform – does anyone want to test this?
Cheers,
Les
February 24th, 2010 - 07:27
hi
http://java.sun.com/products/javamail/javadocs/javax/mail/internet/InternetAddress.html#validate%28%29
this doesn’t work to validate an email’s format ??
February 24th, 2010 - 09:57
@Phil
It doesn’t validate as accurately as the one shown in this blog. It does cursory validation only.
Regards,
Les
April 26th, 2010 - 07:52
After running this in production for 2.5 years I validate @Puk and @Horia’s problem. The regex “freezes” for input “weneedetedoucaccionthatwillsavetheland”.
I found out that changing ALLOW_DOMAIN_LITERALS to false solves the problem.
April 26th, 2010 - 07:55
Correction: changing ALLOW_DOMAIN_LITERALS does not solve this. Sorry for the false alert.
April 29th, 2010 - 13:26
I agree with the author, I am very interested. Also on http://www.ibalashiha.ru. Thank you and Good day, everybody!
June 5th, 2010 - 07:23
The following email addresses take an extremely long time to be evaluated:
protectionandsecurityhqrm2@saps.org
nienetyninecarolinestreet@yahoo.fr
Any idea why?
June 5th, 2010 - 07:58
Here’s another one
NO-MEDIA-OR-RHINOMARKETING-CALLS@CONTROLALTDELETE.CO.ZA
Enjoy.
June 5th, 2010 - 11:59
Ha! Got it.
June 5th, 2010 - 12:21
MARKETING-CALLS@NO-MEDIA-OR-RHINOMARKETING-CALLSCONTROLALTDELETE.CO.ZA
evaluates fine.
MARKETING-CALLS@NO-MEDIA-OR-RHINOMARKETING-CALLSCONTROLALT.DELETENO-MEDIA-OR-RHINOMARKETING-CALLSCONTROLALTDELETE.CO.ZA
even faster
the problem lies with the LocalPart and its length. Not with the domain.
I believe the LocalPart can be broken in parts+domain and if the part evaluate fine, so will the whole and performance of the evaluation will increase.
Not sure what the criteria would be for breaking the LocalPart in chunks.
NO-MEDIA-OR-RHINOMARKETING-CALLS@CONTROLALTDELETE.CO.ZA
could be split in
NO-MEDIA-OR-RHINO@CONTROLALTDELETE.CO.ZA
and
MARKETING-CALLS@CONTROLALTDELETE.CO.ZA
if both evaluate fine than the whole is fine.
I guess where to put the split is not trivial with the more exotic LocalParts.
June 19th, 2010 - 08:28
I am happy to find much useful information in the post, writing sequence is awesome, I always look for quality content, thanks for sharing.