Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linkify incorrectly parses array arguments #436

Closed
M1ha-Shvn opened this issue Jan 18, 2019 · 7 comments · Fixed by #722
Closed

Linkify incorrectly parses array arguments #436

M1ha-Shvn opened this issue Jan 18, 2019 · 7 comments · Fixed by #722
Labels
Milestone

Comments

@M1ha-Shvn
Copy link

Hi.
Library version up to 3.1.0 incorrectly parses array and object url parameters:

from bleach import DEFAULT_CALLBACKS, Linker
text= 'http://test.com?array[]=1&params_in[]=2'
linker = Linker(url_re=linkifier.URL_RE, callbacks=DEFAULT_CALLBACKS, skip_tags=None, parse_email=False)
print(linker.linkify(text))
# prints: <a href="http://test.com?array" rel="nofollow">http://test.com?array</a>[]=1¶ms_in[]=2

As you see, url is split by [], loosing part of the link.

@willkg
Copy link
Member

willkg commented Jan 18, 2019

The url matching code is here:

def build_url_re(tlds=TLDS, protocols=html5lib_shim.allowed_protocols):
"""Builds the url regex used by linkifier
If you want a different set of tlds or allowed protocols, pass those in
and stomp on the existing ``url_re``::
from bleach import linkifier
my_url_re = linkifier.build_url_re(my_tlds_list, my_protocols)
linker = LinkifyFilter(url_re=my_url_re)
"""
return re.compile(
r"""\(* # Match any opening parentheses.
\b(?<![@.])(?:(?:{0}):/{{0,3}}(?:(?:\w+:)?\w+@)?)? # http://
([\w-]+\.)+(?:{1})(?:\:[0-9]+)?(?!\.\w)\b # xx.yy.tld(:##)?
(?:[/?][^\s\{{\}}\|\\\^\[\]`<>"]*)?
# /path/zz (excluding "unsafe" chars from RFC 1738,
# except for # and ~, which happen in practice)
""".format('|'.join(protocols), '|'.join(tlds)),
re.IGNORECASE | re.VERBOSE | re.UNICODE)

The comment there suggests that it'll match characters except those denoted as unsafe characters in RFC 1738. Unsafe characters like [ and ] characters need to be encoded. So that's what's going on here.

RFC 3986 updates RFC 1738 and has a set of reserved characters which includes [ and ]. I think we need to update the url regex to RFC 3986.

I'm pretty busy for a while. I'll accept a PR if someone wants to do one.

@willkg willkg added the linkify label Jan 18, 2019
@mastizada
Copy link
Contributor

mastizada commented Oct 16, 2019

@willkg Removing \[\] escape in regex gave the expected result, but I'm wondering why &params is converted to ¶ms. Is it an issue or an expected behavior for &para?

@filak
Copy link

filak commented Jun 2, 2022

The &para issue

from bleach import Linker
linker = Linker()
text= 'http://test.com?&params=2'
print(linker.linkify(text))
## prints:   <a href="http://test.com?¶ms=2" rel="nofollow">http://test.com?¶ms=2</a>

I believe this happens somewhere in BleachHTMLSerializer class:

class BleachHTMLSerializer(HTMLSerializer):

@willkg
Copy link
Member

willkg commented Jun 2, 2022

The &para thing should be a separate issue. This issue is covering array arguments.

@filak
Copy link

filak commented Jun 2, 2022

@willkg I am sorry, please see #670

@willkg
Copy link
Member

willkg commented Jun 2, 2022

Thank you! I appreciate it!

@willkg
Copy link
Member

willkg commented Oct 6, 2023

Thank you for writing this up!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants