Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow [ and ] as URL code-points #753

Open
karwa opened this issue Feb 11, 2023 · 4 comments
Open

Allow [ and ] as URL code-points #753

karwa opened this issue Feb 11, 2023 · 4 comments
Labels
topic: validation Pertaining to the rules for URL writing and validity (as opposed to parsing)

Comments

@karwa
Copy link
Contributor

karwa commented Feb 11, 2023

Background

RFC-3986 reserves two kinds of delimiters:

  • gen-delims, which are used by the URL syntax itself (things like ? and #), and
  • sub-delims, which are available for use within a URL component to mark out subcomponents.
    For example, & and = are in the sub-delims set, and they are used by query strings to encode key-value pairs.

It's important that we have a set of known subcomponent delimiters, because clients need the assurance that these characters can be used without escaping. An escaped and unescaped subcomponent delimiter must not be equivalent - for example, percent-escaping a & in the query string would merge adjacent key-value pairs and corrupt its meaning.

RFC-3986 also includes [ and ] in the gen-delims set, and does not allow their use anywhere except IP addresses. That document does not explain why these particular characters are forbidden elsewhere.

Its predecessor, RFC-2396, includes these characters in the unwise character set, and possibly provides more insight as to why they must be escaped in URLs:

Other characters are excluded because gateways and other transport
agents are known to sometimes modify such characters, or they are
used as delimiters.

unwise = "{" | "}" | "|" | "" | "^" | "[" | "]" | "`"

Data corresponding to excluded characters must be escaped in order to
be properly represented within a URI.

This escaping means they cannot be used as subcomponent delimiters.

Problem

Despite the above, the URL standard today allows [ and ] to be used unescaped in URL query strings and fragments. They are not URL code-points, but they are tolerated, and have been for such a long time that an ecosystem has emerged which depends on them being available as subcomponent delimiters.

An example of this is the Javascript qs library (>250m downloads per month), used by popular frameworks such as Express.js. It uses square brackets to denote nesting and arrays in key-value pairs.

assert.deepEqual(qs.parse('foo[bar]=baz'), {
    foo: {
        bar: 'baz'
    }
});

and arrays:

var withArray = qs.parse('a[]=b&a[]=c');
assert.deepEqual(withArray, { a: ['b', 'c'] });

Query-strings created by this library will use percent-encoded brackets. This is apparently undesirable though, so they added an option to skip percent-encoding key names, and users unhappy with the escaping are encouraged to use it.

Moreover, the use of an unsanctioned character as a subcomponent delimiter means that brackets in key names are ambiguous:

It uncovers the additional issue that { 'foo[bar]': 'baz' } and { foo: { bar: 'baz' } } both stringify to 'foo%5Bbar%5D=baz'.

And the only way to break this ambiguity would be to say that escaped and unescaped square brackets might not be equivalent, as is the case with all other subcomponent delimiters.

Proposed Resolution

I believe we should accept that unescaped [ and ] are a de-facto part of the web at this point, and include them as valid URL code-points. We already allow them to be used without escaping, and developers have been using them unescaped for some time.

Historically, there has never been any conflict with other URL components (indeed, IPv6 addresses now use square brackets) - there was only some concern about colliding delimiters when embedding URLs, but that concern seems to apply equally to regular parentheses (), which are allowed (and are actually used specifically as URL delimiters in Markdown). Ultimately, the issue of colliding delimiters is an issue for the embedding document to solve, not for the embedded content to attempt to second-guess.

Therefore, IMO, the presence of an unescaped square bracket should not be grounds to call the URL non-valid.

@annevk annevk added the topic: validation Pertaining to the rules for URL writing and validity (as opposed to parsing) label Feb 13, 2023
@annevk
Copy link
Member

annevk commented Feb 13, 2023

Thus far validity has aimed to match RFC 3986 and RFC 3987. (I would be open to allowing all non-ASCII equally (minus surrogates) as it's converted to ASCII anyway. That would also bring it more in line to how we deal with non-ASCII generally.)

This relates to #379 as well.

I think your rationale is quite sound though and I would be open to making this change. But this would be the first intentional deviation from RFC 3986 validity I think. Would love to hear from others as well.

@TimothyGu
Copy link
Member

I tested a range of parsers:

  • For non-special schemes (which is where URL code points are used in the parser):
    • No one ever escapes [ and ] in path, query, and fragment
    • No one ever escapes [ and ] in host, but:
      • Only Go, Chrome, Python urlparse accept it
      • Python urllib3, libcurl, and current spec reject it as invalid
      • Node.js' legacy url parser does something weird (inserts a / before the [)
  • For special schemes:
    • Only Python urllib3 escapes [ and ] in path, query, and fragment
      • FWIW this seems to be a relatively new behavior
    • No one ever escapes [ and ] in host, but:
      • Only Go, Chrome, Python urlparse accept it

@karwa
Copy link
Contributor Author

karwa commented Feb 23, 2023

No one ever escapes [ and ] in host, but:

  • Only Go, Chrome, Python urlparse accept it
  • Python urllib3, libcurl, and current spec reject it as invalid
  • Node.js' legacy url parser does something weird (inserts a / before the [)

Aye, the host behaviour is a bit annoying.

  • Both Go parsers seem to have weird behaviour where they strip square brackets if the entire host is enclosed by a matching pair (so [host] becomes host), but otherwise they keep them (ho[s]t remains as-is). Maybe that's intentional, but honestly it looks more like an oversight.

  • libcurl is overly strict, and rejects plenty of hostnames which RFC-3986 (which I assume it is supposed to follow) would consider valid. For example, 3986's host may be a reg-name, which may contain sub-delims. And yet, it fails to parse a hostname containing the = sign (which is an allowed subdelim), while all others allow it.

  • For this standard, even if they were allowed they would only apply to opaque hostnames (i.e. for non-special schemes), because special schemes must have either a domain/IPv4/IPv6 host.

    They could technically be allowed, and I think there would be value in allowing it, but in order to stop your host sometimes being interpreted as an IPv6 address, we'd need to either ban matching brackets enclosing the address (e.g. [host] would not be allowed, but ho[st] would), or just stop parsing IPv6 addresses in opaque hostnames altogether and let the application decide when that interpretation is appropriate (which is totally fair - opaque URL components are extremely valuable for encoding custom data. But right now it's opaque-except-nonopaque-when-IPv6; maybe we should just let it be truly opaque and drop the weird special case).

@theking2
Copy link

Are all proxy servers aware of and implement the mentioned rfc correctly? I worked on a url shortener using '[' and ']' as part of a shortened url and some mobile applications cut of the url at those characters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic: validation Pertaining to the rules for URL writing and validity (as opposed to parsing)
Development

No branches or pull requests

4 participants