ElixirPhpEmailValidator
A bug-for-bug compatible Elixir port of PHP's
filter_var($email, FILTER_VALIDATE_EMAIL).
It answers one question — "would PHP accept this string as an email address?" —
and returns the exact sametrue/false verdict PHP does, quirks and all.
ElixirPhpEmailValidator.valid?("a@b.c") #=> true
ElixirPhpEmailValidator.valid?("a@b") #=> false # PHP rejects: no TLD
ElixirPhpEmailValidator.valid?("user@1.2.3.4") #=> false # PHP rejects bare IPs
ElixirPhpEmailValidator.valid?("user@[1.2.3.4]") #=> true # …but accepts bracketed ones
ElixirPhpEmailValidator.valid?(~s("a b"@c.de)) #=> false # bare space illegal even quoted
ElixirPhpEmailValidator.valid?("日本語@example.com") #=> false
ElixirPhpEmailValidator.valid?("日本語@example.com", unicode: true) #=> true
This is not an RFC validator, and not a "good" email validator. It is a faithful mirror of a specific, widely-deployed C function. If your goal is to match what a PHP backend (or anything else built on
filter_var) accepts — e.g. for migrating a system, sharing validation rules across a polyglot stack, or reproducing a bug — this gives you byte-identical behaviour. If you just want "a reasonable email check", a normal regex orEctoCommons.EmailValidatoris probably what you want instead.
Installation
def deps do
[{:elixir_php_email_validator, "~> 0.1"}]
end
Zero runtime dependencies. Requires Elixir ~> 1.15. OTP 28+ is recommended
for guaranteed Unicode-mode (FILTER_FLAG_EMAIL_UNICODE) parity — OTP 28 is the
first release whose :re is backed by PCRE2, the same engine family PHP uses.
The default (ASCII) mode is engine-independent. CI validates the OTP matrix.
API
# boolean
ElixirPhpEmailValidator.valid?(email) # default flags
ElixirPhpEmailValidator.valid?(email, unicode: true) # FILTER_FLAG_EMAIL_UNICODE
# PHP-style return: {:ok, email} on success (PHP returns the string), :error on failure
ElixirPhpEmailValidator.validate(email) #=> {:ok, "a@b.c"} | :error
# provenance of the vendored PHP regex (php-src ref, checksums, etc.)
ElixirPhpEmailValidator.source_info()
:unicode mirrors PHP's FILTER_FLAG_EMAIL_UNICODE: it permits Unicode letters
and numbers in the local part only; the domain always stays ASCII.
The quirks (the whole point)
PHP's validator has well-known surprises. This library reproduces every one:
| Input | valid? | Why |
|---|---|---|
a@b.c | true | a single-character TLD is fine |
a@b | false | the domain needs a dot / a TLD label |
user@localhost | false | single-label domains are rejected |
user@1.2.3.4 | false | a bare IP is not a valid domain |
user@[1.2.3.4] | true | …but a bracketed IP literal is |
a@[IPv6:::1] | true | bracketed IPv6 literals too |
"a b"@c.de | false | a bare space is illegal even inside quotes |
much."more\ unusual"@x.com | true | …but a backslash-escaped space is legal |
"a@b"@c.de | true | @ is fine inside a quoted local part |
a@b.123 | false | an all-numeric TLD is rejected |
a@b--c.com | true | consecutive hyphens in a label are allowed |
a@b.com. | false | a trailing dot is rejected |
USER@EXAMPLE.COM | true | matching is case-insensitive |
"a\x07b"@x.com | true | some control bytes (0x01–0x08, 0x0B, 0x0C) are legal in quotes… |
"a\tb"@x.com | false | …but TAB (0x09), LF, CR are not |
a@b.c\n | false | a trailing newline is rejected (the D flag) |
| 255-byte address | false | the regex caps total length at 254, not 320 (see below) |
😀@x.com | false (even unicode: true) | an emoji is a symbol, not a letter/number |
A full, categorized catalog of 136 ground-truthed cases lives in
docs/research/04-quirks-catalog.md, and
every one of them is an executed test case.
How it works (and why it is faithful)
PHP implements FILTER_VALIDATE_EMAIL in C, in
ext/filter/logical_filters.c → php_filter_validate_email.
The algorithm is small:
- Reject any input longer than 320 bytes.
- Match against one anchored PCRE regex with the inline flags
i(caseless) andD(dollar-end-only). There are two regexes: an ASCII one by default, and a Unicode-aware one (adding\pL\pNto the local part, plus theuflag) used only withFILTER_FLAG_EMAIL_UNICODE.
This library vendors PHP's exact regex strings, byte-for-byte (see
priv/php/ and the checksums in
priv/php/MANIFEST.json) and performs only a
mechanical flag translation:
| PHP inline flag | Erlang :re option |
|---|---|
i | :caseless |
D (PCRE_DOLLAR_ENDONLY) | :dollar_endonly |
u (PCRE_UTF8 + UCP) | :unicode, :ucp |
Both PHP (PCRE2) and Erlang's :re are PCRE-family engines, so the same pattern
yields the same matches. The library matches the raw bytes of the input
(exactly as PHP's pcre2_match does), enforces the 320-byte gate in Elixir, and
treats invalid-UTF-8 input on the unicode path as a non-match (which is what PHP
does when PCRE reports a UTF-8 error).
There is no hand-written validation logic to get subtly wrong: the regex is
PHP's regex. See COMPATIBILITY.md for the full argument.
Performance & untrusted input
Validation is a single anchored PCRE match — instant for real addresses. But the
vendored pattern (PHP's own) can be driven into heavy backtracking by a
deliberately crafted input, even a short one: a worst-case string costs ~200 ms
of CPU inside the non-yielding :re NIF before the engine aborts the match. The
verdict stays correct (it's rejected, exactly as PHP rejects it) — the cost is
pure CPU, and the 320-byte limit caps input length, not per-call CPU.
So if you validate untrusted input at high volume, run validation off your
request-serving schedulers (e.g. a supervised Task) and/or rate-limit upstream.
The library deliberately does not cap PCRE's steps itself: a cap tight enough
to matter would also reject some legitimate addresses PHP accepts — breaking the
parity that is the whole point. See the module docs for detail.
Verify it yourself
Correctness here is demonstrated, not asserted. Three layers of tests:
mix test # golden regression — no PHP needed
mix test --include php # + live parity & differential fuzz vs your local php
mix php.test # regenerate the golden from your php, then run everything
mix testasserts the library reproduces, for ~165 corpus inputs, the verdicts recorded from real PHP intest/fixtures/golden/php-*.tsv. This runs anywhere, even with no PHP installed — the golden files are PHP's frozen testimony.mix test --include phpadditionally re-derives the verdicts from your installedphpand runs a differential fuzzer (thousands of structured + random byte strings diffed againstfilter_var), catching any disagreement a hand-written corpus would miss.- CI (
.github/workflows/ci.yml) runs the live suite against every currently-supported PHP version (auto-discovered from endoflife.date, so new releases are tested and EOL ones drop off), proving parity per version. It also runs weekly and watches the newest stable Elixir (a canary) and the newest stable OTP (a differential-vs-PHP watch job), so drift from a new PHP, Elixir, or OTP release surfaces on its own.
How you'll know if PHP changes
mix php.drift # fetch php-src for 8.1–8.5, re-extract the regex, diff vs vendored
This is the early-warning system. It re-fetches logical_filters.c from php-src
for each supported release, re-extracts the two regexes, and fails if they differ
from the vendored copy. A scheduled CI job
(.github/workflows/drift.yml) runs it weekly, so
if PHP ever edits the validator you find out immediately. To adopt an upstream
change: mix php.extract <ref> re-vendors the regex, then mix php.test confirms
(or surfaces) any behavioural difference.
Which PHP does this match?
The validator's regex and the 320-byte check are byte-identical across PHP 8.1,
8.2, 8.3, 8.4, 8.5 and master (and unchanged upstream since 2011/2015). So a
single vendored copy is correct for the entire currently-supported PHP range
simultaneously. The provenance and per-version checksum evidence are in
docs/research/01-version-matrix.md.
Documentation & research
COMPATIBILITY.md— the correctness argument in full.docs/research/— the underlying research dossier: PHP source analysis, cross-version proof, the Elixir-ecosystem survey (why nothing else is equivalent), the 136-case quirks catalog, and the PCRE/packaging notes.
Contributing & releases
This project releases itself: commits to main using
Conventional Commits drive automatic
version bumps, changelog updates, tags, and publishing to Hex via GitHub Actions
(no credential is ever exposed). See RELEASING.md for the
pipeline and the one-time setup. The vendored regex is kept honest by a weekly
drift detector against php-src.
License & attribution
Copyright © 2026 Daniel Dent. The Elixir code, tests,
tooling, and docs are MIT-licensed (see LICENSE). The vendored regex derives from Michael
Rushton's validator and is redistributed from PHP; its copyright notice is
preserved in NOTICE, as its terms require.