What Is a Regular Expression (Regex)?
A regular expression (often shortened to regex or regexp) is a sequence of characters that defines a search pattern. Regular expressions are used across virtually every programming language and text editor to find, match, and manipulate strings of text. They are incredibly powerful for pattern matching tasks such as validating user input, searching through large bodies of text, or extracting specific data like email addresses, phone numbers, or URLs.
At its core, a regex works by describing the structure of the text you want to match. Instead of searching for a specific word, you describe the pattern of characters. For example, the regex \d{3} matches any three consecutive digits, whether that is "123", "456", or "789".
When it comes to email extraction, regex is the go-to method. Every email address follows a predictable structure: a local part, the @ symbol, and a domain. This predictable format makes email addresses ideal candidates for regex matching.
The Standard Email Regex Pattern
The most commonly used email regex pattern is:
[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}
This pattern strikes a balance between accuracy and simplicity. It will correctly match the vast majority of real-world email addresses without being overly complex. Let us break down exactly what each part does.
Breaking Down Each Part
1. The Local Part: [a-zA-Z0-9._%+-]+
This is the portion before the @ symbol (e.g., "john.doe" in john.doe@example.com). The character class [a-zA-Z0-9._%+-] matches:
a-z– lowercase lettersA-Z– uppercase letters0-9– digits.– periods (dots)_– underscores%– percent signs+– plus signs-– hyphens
The + quantifier after the bracket means "one or more of these characters." So the local part must contain at least one character from the allowed set.
2. The At Symbol: @
This is a literal match for the @ character. Every valid email address contains exactly one @ symbol separating the local part from the domain. In regex, the @ character has no special meaning, so it matches itself directly.
3. The Domain Name: [a-zA-Z0-9.-]+
This matches the domain portion (e.g., "example" or "mail.example" in user@mail.example.com). The allowed characters are:
a-z,A-Z– letters0-9– digits.– periods (for subdomains like mail.example)-– hyphens (common in domain names like my-company.com)
Again, the + quantifier requires at least one character.
4. The Dot Before the TLD: \.
This matches a literal period (dot) character. The backslash is necessary because in regex, an unescaped . matches any character. The \. ensures we match only an actual dot, which separates the domain name from the top-level domain.
5. The Top-Level Domain (TLD): [a-zA-Z]{2,}
This matches the TLD such as "com", "org", "net", "de", or newer TLDs like "technology" or "email". The {2,} quantifier means "two or more characters," which ensures we match valid TLDs (the shortest ones like ".ai" or ".uk" have two characters) while also accommodating longer TLDs like ".museum" or ".photography".
Common Regex Variations
Simple and Permissive
If you just need a quick check and do not care about edge cases:
.+@.+\..+
This matches "anything, then @, then anything, then a dot, then anything." It is fast but will match many invalid strings.
Strict Practical Pattern
A more strict pattern that enforces reasonable constraints:
^[a-zA-Z0-9](?:[a-zA-Z0-9._%+-]{0,63})@(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,63}$
This version adds several improvements:
- The local part must start with an alphanumeric character
- Domain labels must start and end with alphanumeric characters
- Maximum label lengths are enforced (63 characters per RFC 1035)
- The TLD is limited to 63 characters
RFC 5322 Compliant Pattern
The official RFC 5322 standard defines the full email syntax. A truly compliant regex is extremely long and complex:
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
In practice, almost nobody uses the full RFC 5322 regex. The standard pattern [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,} covers 99.9% of real-world email addresses and is far easier to read, maintain, and debug.
Code Examples
JavaScript
Extract all emails from a stringfunction extractEmails(text) {
const regex = /[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/g;
const matches = text.match(regex);
return matches ? [...new Set(matches)] : [];
}
// Example usage
const text = "Contact us at info@example.com or support@my-company.org";
const emails = extractEmails(text);
console.log(emails);
// Output: ["info@example.com", "support@my-company.org"]
Validate a single email address
function isValidEmail(email) {
const regex = /^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$/;
return regex.test(email);
}
console.log(isValidEmail("user@example.com")); // true
console.log(isValidEmail("invalid@.com")); // false
console.log(isValidEmail("no-at-sign.com")); // false
Python
Extract and validate emailsimport re
def extract_emails(text):
pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
return list(set(re.findall(pattern, text)))
def is_valid_email(email):
pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
return bool(re.match(pattern, email))
# Extract emails from text
text = """
Please reach out to sales@example.com for pricing.
Technical support is available at help@support.example.org.
You can also email the CEO directly: ceo@big-company.co.uk
"""
emails = extract_emails(text)
print(emails)
# Output: ['sales@example.com', 'help@support.example.org', 'ceo@big-company.co.uk']
# Validate single email
print(is_valid_email("test@example.com")) # True
print(is_valid_email("not-an-email")) # False
PHP
Extract emails from a string<?php
function extractEmails(string $text): array {
$pattern = '/[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/';
preg_match_all($pattern, $text, $matches);
return array_unique($matches[0]);
}
// Example usage
$text = "Email us at contact@example.com or admin@website.org for help.";
$emails = extractEmails($text);
print_r($emails);
// Output: Array ( [0] => contact@example.com [1] => admin@website.org )
// Validate a single email (PHP also has a built-in function)
$email = "user@example.com";
if (filter_var($email, FILTER_VALIDATE_EMAIL)) {
echo "$email is valid\n";
}
?>
Common Pitfalls and Edge Cases
1. Internationalized Email Addresses (IDN)
The standard regex does not handle internationalized domain names (IDN) or local parts with Unicode characters. Email addresses like user@beispiel.de work fine, but user@bsp.xn--e1afmapc (Punycode) or addresses with characters like umlauts in the local part (müller@example.com) will not be matched. If you need to support these, you must expand the character classes to include Unicode ranges.
2. New and Long TLDs
Since the introduction of new generic TLDs (gTLDs), email addresses can end with TLDs like .photography, .technology, or .international. The {2,} quantifier in our pattern handles these correctly. However, older patterns that used {2,4} would fail on these longer TLDs – make sure your regex uses {2,} or at least {2,63}.
3. Quoted Local Parts
The email specification technically allows quoted strings in the local part, such as "john doe"@example.com or "very.(),:;<>[]".VERY.unusual@example.com. These are valid according to RFC 5322 but are extremely rare in practice. The standard regex does not match them, which is acceptable for most real-world use cases.
4. IP Address Domains
Emails can technically use IP addresses instead of domain names: user@[192.168.1.1] or user@[IPv6:2001:db8::1]. These are valid but almost never seen in practice. The standard regex will not match them.
5. Consecutive Dots
The standard regex allows consecutive dots in the local part (e.g., user..name@example.com), which is technically invalid per RFC 5321. If you need to reject these, add a negative lookahead: (?!.*\.\.)[a-zA-Z0-9._%+-]+.
6. Trailing Periods in Extracted Text
When extracting emails from natural text, sentences like "Contact us at info@example.com." can result in the trailing period being captured as part of the email. Our regex handles this correctly because \.[a-zA-Z]{2,} requires at least two letters after the final dot, so a trailing period followed by a space or end of sentence will not be included.
When to Use Regex vs. Built-in Validation
Many programming languages offer built-in email validation that is more robust than a custom regex:
- PHP:
filter_var($email, FILTER_VALIDATE_EMAIL) - Python: The
email-validatorlibrary provides thorough validation including DNS checks - JavaScript: HTML5
<input type="email">provides browser-native validation - .NET:
System.Net.Mail.MailAddressparses and validates emails
Use regex when you need to extract emails from unstructured text. Use built-in validators when you need to validate a single email address from a form field. For extraction tasks, regex is the clear winner because built-in validators only check one address at a time and cannot scan through text.
Performance Tips
When processing very large amounts of text (megabytes of data), keep these performance tips in mind:
- Compile the regex: In languages like Python, use
re.compile()to pre-compile the pattern if you are using it repeatedly - Use the global flag: In JavaScript, always include the
gflag to find all matches, not just the first one - Avoid backtracking: The standard email regex is efficient and does not cause catastrophic backtracking, but overly complex patterns can
- Deduplicate results: Use a Set (JavaScript) or set (Python) to efficiently remove duplicate email addresses from your results
Try the Free Email Extractor
Extract email addresses from any text, file, or URL – instantly and securely in your browser.
Open Email Extractor