Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: improve email regex #180

Merged
merged 4 commits into from Oct 6, 2023

Conversation

kazizi55
Copy link
Contributor

Improves email regex so that all tests (including commented-out tests) in library/src/validations/email/email.test.ts pass.

To achieve this improvement, I changed the regex logic from deny list strategy to allow list strategy, because with deny list strategy, special characters (such as Hiragana, Chinese characters, Hangul characters and so on) are so difficult to validate. I think they are too many to list.

The original allow list strategy regex logic is copied from zod, and I modified a littile bit.
https://github.com/colinhacks/zod/blob/f59be093ec21430d9f32bbcb628d7e39116adf34/src/types.ts#L567-L568

@netlify
Copy link

netlify bot commented Sep 23, 2023

Deploy Preview for valibot ready!

Name Link
🔨 Latest commit 2bb7428
🔍 Latest deploy log https://app.netlify.com/sites/valibot/deploys/651f97337d84ce0008d3d8e4
😎 Deploy Preview https://deploy-preview-180--valibot.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@kazizi55 kazizi55 marked this pull request as ready for review September 23, 2023 10:15
@FleetAdmiralJakob
Copy link

Looks cool

@fabian-hiller
Copy link
Owner

Thanks for your contribution! Can you explain with examples what your changes to the regex do and why they are necessary?

@fabian-hiller fabian-hiller self-assigned this Sep 23, 2023
@fabian-hiller fabian-hiller added the enhancement New feature or request label Sep 23, 2023
@fabian-hiller
Copy link
Owner

As soon as I get a response from you, I will review the changes and if everything fits, I will merge it.

@kazizi55
Copy link
Contributor Author

@fabian-hiller
Thank you for your quick response, and I apologize for my lack of explanation 🙏

why they are necessary?

Just because you put FIXME comment in library/src/validations/email/email.test.ts.
https://github.com/fabian-hiller/valibot/pull/180/files#diff-5749fd906e956e45be3645c6d1f4a32b5872bd88160a7364b775c6feb0f2d61eL43-L47

I just wanted to remove these FIXME comments, so I fixed the regex.

Can you explain with examples what your changes to the regex

The email regex of valibot (^(([^<>()[\].,;:\s@"]+(\.[^<>()[\].,;:\s@"]+)*)|(".+"))@(([^<>()[\].,;:\s@"]+\.)+[^<>()[\].,;:\s@"]{2,})$/i) is now like this:

  • user name
    • There are 2 patterns
      • one
        • The characters below are prohibited at the beginning of user name
          • <>()[\].,;:\s@"
        • The characters below are prohibited in user name
          • <>()[\],;:\s@"
            • the difference between this and above characters is .. Only . can be used in user name.
      • the other
        • If quoted by double quotation marks, everything is valid
  • domain name
    • The characters below are prohibited before .
      • <>()[\].,;:\s@"
    • Except prohibited characters, characters + . pattern can be repeated
    • The characters below are prohibited after .
      • <>()[\].,;:\s@"
    • there must be more than 2 characters after .

And I modified the email regex (^(“|[A-Z0-9_+-]+\.?)*[A-Z0-9_+-]”?@(([A-Z0-9][A-Z0-9-]*\.)+[A-Z]{2,}|([0-9]{3}\.[0-9]{3}\.[0-9]{3}\.[0-9]{3}))) like this:
(Since I copied the logic from Zod, I quoted the sentences from colinhack's blog post)

  • user name
    • can use these characters: [a-zA-Z0-9-._]
    • can't have two dots in a row
    • can't start or end with a dot
    • can use left and right double quotation marks
      • I added this logic to Zod logic so that this test can be passed
  • domain name
    • There are 2 patterns
      • one
        • can use these characters: [a-zA-Z0-9-.]
        • must have one or more dots in it
        • can't have two dots in a row
        • can't start or end with a dot
        • can't start or end with a hyphen
        • must end with a "TLD" that's 2+ letters, only [a-zA-Z] allowed
      • the other
        • can use 3 digits * 4 pattern
          • I added this log to Zod logic, so that this test can be passed

Please let me know if I can give you any other information 🙏

@fabian-hiller
Copy link
Owner

Thank you for the details. I will try to review the PR over the weekend.

If quoted by double quotation marks, everything is valid

Why is this necessary?

@kazizi55
Copy link
Contributor Author

@fabian-hiller
Thank you for checking 🙏

If quoted by double quotation marks, everything is valid
Why is this necessary?

This is the regex valibot has now, and I don't think it's necessary.
So I deleted this logic in this PR.

@fabian-hiller
Copy link
Owner

Zod has recently changed the email regex, we should take a look at that. Maybe the PR at Zod is similar to this one. Is there anything against using the regex from Zod?

@kurtextrem
Copy link
Contributor

kurtextrem commented Oct 4, 2023

My humble opinion: colinhacks/zod#2157 introduced the vulnerable regex. The zod regex is not battle-tested apart from zod users. I realize zod is popular, but email regexes have been around since ages. I get the motivation behind self-writing the regex to be simpler, but why re-invent the wheel?

^[\w!#$%&'*+/=?`{|}~^-]+(?:\.[\w!#$%&'*+/=?`{|}~^-]+)*@(?:[A-Z0-9-]+\.)+[A-Z]{2,6}$`

We get a more practical implementation of RFC 5322 if we omit IP addresses, domain-specific addresses, the syntax using double quotes and square brackets. It will still match 99.99% of all email addresses in actual use today.

\A[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\z

And last but not least StackOverflow where all the people will come to comment why the regex in the answer does not fit the spec: https://stackoverflow.com/questions/201323/how-can-i-validate-an-email-address-using-a-regular-expression. I want to highlight the HTML5 spec regex, browsers have implemented:

^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$

So if you validate a <input type=email field, only that regex will reflect the validation of that input (see https://rgxdb.com/r/5PUY334X). And I think, especially looking back at zod#2157, makes the most sense: valibot is probably used for validating such inputs. Someone in the comments of the Zod PR also mentioned the HTML5 spec regex, so I'm guessing it was an oversight to stick with the newly crafted one. The PR also mentions why the zod regex is bad: emails like O'railey@mydomain.com will validate to false in zod.

@fabian-hiller
Copy link
Owner

Thank you a lot for this hint! I will think about it and investigate the official HTML email regex.

@fabian-hiller
Copy link
Owner

I have now investigated it and decided against the official HTML regex, because with this one the following email addresses are valid:

  • .email@example.com
  • email.@example.com
  • email..email@example.com
  • email@example

I don't think this makes sense in practice. I like the regex of this PR. I would only remove the extensions you added as they are not needed 99% of the time.

@fabian-hiller fabian-hiller merged commit 3ec0c7f into fabian-hiller:main Oct 6, 2023
5 checks passed
@kurtextrem
Copy link
Contributor

kurtextrem commented Oct 6, 2023

@fabian-hiller you introduced an exponential regex (which is why the 'extensions' are needed), which can be abused for a DoS: https://devina.io/redos-checker
image

I want to emphasize again why it does not make sense to use a handwritten regex. It might be shorter, saving a few bytes, but at what expense?

  • Valid emails such as O'Reilly@gmail.com are not allowed
    image

  • Valid inputs from browsers are rejected server side, leading to a confusing behaviour

    • I get your point of invalid emails being valid, but Chromium and other big vendors see this as ok - why don't we?
    • What about introducing 2 regexes, one that fits input[email] validation and one that is closer to the spec (disallowing some of the invalid emails)

I really want to emphasize my main point again: Why hand roll a Regex, when two or even 3 trustable resources crafted a relatively short regex already, O'Reilly, Regular-Expressions.Info and even the spec authors.

/^[\w!#$%&'*+/=?`{|}~^-]+(?:\.[\w!#$%&'*+/=?`{|}~^-]+)*@(?:[A-Z0-9-]+\.)+[A-Z]{2,6}$/ // O'Reilly
/^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$/ // web spec

When checking emails, please always think about: regexes don't send emails.
Regular-Expressions.info also lists regexes that don't capture ' emails:

An important trade-off in all these regexes is that they only allow English letters, digits, and the most commonly used special symbols. The main reason is that I don’t trust all my email software to be able to handle much else. Even though John.O'Hara@theoharas.com is a syntactically valid email address, there’s a risk that some software will misinterpret the apostrophe as a delimiting quote. Blindly inserting this email address into an SQL query, for example, will at best cause it to fail when strings are delimited with single quotes and at worst open your site up to SQL injection attacks.

However, I do think this is a teaching issue, not something validators should discriminate (e.g. teach that an email could contain ').

@kurtextrem
Copy link
Contributor

A small additional thing: valibot does not seem like it's using any eslint plugins. I'd strongly recommend using https://ota-meshi.github.io/eslint-plugin-regexp/rules/ as a lot can go wrong with regexes; that plugin has rules catching such DoS vulnerable regexes. And maybe also https://github.com/eslint-community/eslint-plugin-security for general security stuff that should not make it into a validator lib.

@fabian-hiller
Copy link
Owner

Thanks for the hint. I had forgotten to check that for this regex. Then I currently tend to use the new regex from Zod. I understand your point, but I think that in 99% of cases this regex does what the user expects. In special cases, for example, the w3.org regex can be used with our regex function:

const EmailSchema = string([regex(/^[a-zA-Z0-9.!#$%&’*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*$/)]);

I will have a look at the ESLint plugins in the next days. If you like you can also create a PR.

fabian-hiller added a commit that referenced this pull request Oct 6, 2023
Co-authored-by: Macs Dickinson <MacsDickinson@users.noreply.github.com>
@fabian-hiller
Copy link
Owner

If we come up with a good name, I can also imagine adding another email validation function with an official regex. The question is whether to use w3.org or whatwg.org. Do you have an opinion? As a name we could call the function specEmail. Do you have ideas for other names?

@kurtextrem
Copy link
Contributor

kurtextrem commented Oct 6, 2023

I think that in 99% of cases this regex does what the user expects

But what do users expect? The examples you've listed above are covered by the Regex book from O'Reilly, too. And the regex could be modified to prevent ', if that is your concern. The difference of both regexes is 5 bytes, and the O'Reilly regex has no negative lookaheads, which leads to almost 50% higher performance: https://esbench.com/bench/652054377ff73700a4deba38 (latest chrome dev v120, V8 11.9.146.1)

image

I really see literally no advantage in using that regex.

I will have a look at the ESLint plugins in the next days. If you like you can also create a PR.

Sure. Do you have any other eslint plugins in mind you'd like to use or just the regexps one (and maybe the security one)?

As a name we could call the function specEmail. Do you have ideas for other names?

specEmail + documenting it's the regex from input[type=email] sounds good to me!

@fabian-hiller
Copy link
Owner

Thanks again for this insight. I will review the O'Reilly regex in the next days and may switch to it. Would you still add a specEmail function then? And with input[type=email] you mean the shorter W3C regex or?

I am only familiar with the basics of linting plugins, so I don't have an opinion at the moment. You are welcome to make a suggestion with a PR.

@fabian-hiller
Copy link
Owner

I can imagine using the regex from O'Reilly with a few changes.

/^[\w!#$%&'*+/=?`{|}~^-]+(?:\.[\w!#$%&'*+/=?`{|}~^-]+)*@(?:[A-Z0-9-]+\.)+[A-Z]{2,6}$/i

I would limit the special characters !#$%&'*+/=?`{|}~^- to meaningful characters +-.

/^[\w+-]+(?:\.[\w+-]+)*@(?:[A-Z0-9-]+\.)+[A-Z]{2,6}$/i

Also, I would prefer to prevent a domain name from starting and ending with a hyphen.

/^[\w+-]+(?:\.[\w+-]+)*@[A-Z0-9]+(?:(?:\.|-)[A-Z0-9]+)*\.[A-Z]{2,6}$/i

Then I would remove the restriction on the maximum length of the domain, because as far as I know there are now TLDs with more characters.

/^[\w+-]+(?:\.[\w+-]+)*@[A-Z0-9]+(?:(?:\.|-)[A-Z0-9]+)*\.[A-Z]{2,}$/i

Feel free to share your thoughts on this.

@kurtextrem
Copy link
Contributor

kurtextrem commented Oct 7, 2023

Also, I would prefer to prevent a domain name from starting and ending with a hyphen.

Fully agree, also a nice catch!

Then I would remove the restriction on the maximum length of the domain, because as far as I know there are now TLDs with more characters.

Great catch: https://developer.mozilla.org/en-US/docs/Learn/Common_questions/Web_mechanics/What_is_a_domain_name -> 63 is max length:

/^[\w+-]+(?:\.[\w+-]+)*@[A-Z0-9]+(?:(?:\.|-)[A-Z0-9]+)*\.[A-Z]{2,63}$/i

by using regexp-tree optimizer, I got it to:

/^[\w+-]+(?:\.[\w+-]+)*@[\da-z]+(?:[.-][\da-z]+)*\.[a-z]{2,63}$/i

(I was curious if using the i mod makes the regex slower, faster or equal compared to typing out [a-zA-Z] and the result is equal, including regex parsing cost)

Here is the full list of TLDs: https://data.iana.org/TLD/tlds-alpha-by-domain.txt, looks like it also contains XN--11B4C3D (कॉम]), but I'm not sure if those domains are really spread (I'd say we can ignore it, but we should be aware that we do)

Would you still add a specEmail function then? And with input[type=email] you mean the shorter W3C regex or?

this one is the w3c regex:
/^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$/

I still see value in adding a regex that allows more emails, because some people might not insert emails into SQL statements, but rather forward it to a mail client. Imho it's up to a mail client to reject wrong emails if a browser also accepts them. However, we could also add the full W3C regex instead (if we find one that is working with JS). There might be downsides of it:

The reason you shouldn’t use this regex is that it is overly broad. Your application may not be able to handle all email addresses this regex allows. Domain-specific routing addresses can contain non-printable ASCII control characters, which can cause trouble if your application needs to display addresses. Not all applications support the syntax for the local part using double quotes or square brackets. In fact, RFC 5322 itself marks the notation using square brackets as obsolete.

So... maybe a trade-off would be using O'Reilly before and the spec part after @ ? e.g.

/^[\w!#$%&'*+/=?^{|}~-]+(?:\.[\w!#$%&'*+/=?^{|}~-]+)*@[\da-z](?:[\da-z-]{0,61}[\da-z])?(?:\.[\da-z](?:[\da-z-]{0,61}[\da-z])?)*$/i

That also captures abc@def.xn--11b4c3d and disallows emails starting with ., and domains starting with -. So we've successfully reduced false-positives coming in from browsers and also are closer to the spec as people can use O'Reilly@abc.XN--11B4C3D as email.

By the way, I think we really got a good regex now!

@fabian-hiller
Copy link
Owner

Thanks again for your research. I like your result! Do you want to create a PR or should I commit the regex directly with you as co-author?

(I was curious if using the i mod makes the regex slower, faster or equal compared to typing out [a-zA-Z] and the result is equal, including regex parsing cost)

Haha, I also compared that 😅

For specEmail, I recommend creating an issue or PR with all the information and waiting for others to respond to see if there is interest in it before we merge it.

@kurtextrem
Copy link
Contributor

Thanks again for your research. I like your result! Do you want to create a PR or should I commit the regex directly with you as co-author?

Co-author is perfectly fine for me

For specEmail, I recommend creating an issue or PR with all the information and waiting for others to respond to see if there is interest in it before we merge it.

will do

fabian-hiller added a commit that referenced this pull request Oct 8, 2023
Co-authored-by: Jacob Groß <kurtextrem@users.noreply.github.com>
@fabian-hiller
Copy link
Owner

The regex will go live with the next update. For info, I decided not to add the domain length limit, since the specification is so far away from real TLDs that I don't see any advantage in it. It also does not help to limit the total length. You have to use maxLength for that.

@kazizi55
Copy link
Contributor Author

kazizi55 commented Oct 8, 2023

@fabian-hiller @kurtextrem
Thank you very much for improving my email regex!!

@fabian-hiller
And sorry that I couldn't reply to your comments quickly, because I was busy currently 🙏

@fabian-hiller
Copy link
Owner

No problem!

@kurtextrem
Copy link
Contributor

kurtextrem commented Oct 8, 2023

For info, I decided not to add the domain length limit, since the specification is so far away from real TLDs that I don't see any advantage in it. It also does not help to limit the total length. You have to use maxLength for that.

Fine to me, but the max length of the TLD could be checked by the regex (via $ at the end). I think for max length of the email, that's probably another topic (and shouldn't be done by a regex)

Thank you very much for improving my email regex!!

🤝

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants