Linkify incorrectly parses array arguments #436

M1ha-Shvn · 2019-01-18T09:29:04Z

Hi.
Library version up to 3.1.0 incorrectly parses array and object url parameters:

from bleach import DEFAULT_CALLBACKS, Linker
text= 'http://test.com?array[]=1&params_in[]=2'
linker = Linker(url_re=linkifier.URL_RE, callbacks=DEFAULT_CALLBACKS, skip_tags=None, parse_email=False)
print(linker.linkify(text))
# prints: <a href="http://test.com?array" rel="nofollow">http://test.com?array</a>[]=1¶ms_in[]=2

As you see, url is split by [], loosing part of the link.

willkg · 2019-01-18T13:55:03Z

The url matching code is here:

bleach/bleach/linkifier.py

Lines 32 to 53 in 2f210e0

    
           def build_url_re(tlds=TLDS, protocols=html5lib_shim.allowed_protocols): 
        
               """Builds the url regex used by linkifier 
        
              If you want a different set of tlds or allowed protocols, pass those in 
        
              and stomp on the existing ``url_re``:: 
        
                  from bleach import linkifier 
        
                  my_url_re = linkifier.build_url_re(my_tlds_list, my_protocols) 
        
                  linker = LinkifyFilter(url_re=my_url_re) 
        
               """ 
        
               return re.compile( 
        
                   r"""\(*  # Match any opening parentheses. 
        
                   \b(?<![@.])(?:(?:{0}):/{{0,3}}(?:(?:\w+:)?\w+@)?)?  # http:// 
        
                   ([\w-]+\.)+(?:{1})(?:\:[0-9]+)?(?!\.\w)\b   # xx.yy.tld(:##)? 
        
                   (?:[/?][^\s\{{\}}\|\\\^\[\]`<>"]*)? 
        
                       # /path/zz (excluding "unsafe" chars from RFC 1738, 
        
                       # except for # and ~, which happen in practice) 
        
                   """.format('|'.join(protocols), '|'.join(tlds)), 
        
                   re.IGNORECASE | re.VERBOSE | re.UNICODE)

The comment there suggests that it'll match characters except those denoted as unsafe characters in RFC 1738. Unsafe characters like [ and ] characters need to be encoded. So that's what's going on here.

RFC 3986 updates RFC 1738 and has a set of reserved characters which includes [ and ]. I think we need to update the url regex to RFC 3986.

I'm pretty busy for a while. I'll accept a PR if someone wants to do one.

mastizada · 2019-10-16T15:35:24Z

@willkg Removing \[\] escape in regex gave the expected result, but I'm wondering why &params is converted to ¶ms. Is it an issue or an expected behavior for &para?

filak · 2022-06-02T15:16:29Z

The &para issue

from bleach import Linker
linker = Linker()
text= 'http://test.com?&params=2'
print(linker.linkify(text))
## prints:   <a href="http://test.com?¶ms=2" rel="nofollow">http://test.com?¶ms=2</a>

I believe this happens somewhere in BleachHTMLSerializer class:

bleach/bleach/html5lib_shim.py

Line 661 in ed06d4e

class BleachHTMLSerializer(HTMLSerializer):

willkg · 2022-06-02T15:20:19Z

The &para thing should be a separate issue. This issue is covering array arguments.

filak · 2022-06-02T19:59:03Z

@willkg I am sorry, please see #670

willkg · 2022-06-02T20:01:24Z

Thank you! I appreciate it!

willkg · 2023-10-06T18:41:00Z

Thank you for writing this up!

willkg added the linkify label Jan 18, 2019

willkg mentioned this issue Oct 6, 2023

Fix linkify with arrays in querystring (#436) #722

Merged

willkg closed this as completed in #722 Oct 6, 2023

willkg added a commit that referenced this issue Oct 6, 2023

Fix linkify with arrays in querystring (#436)

c4a4eba

willkg added this to the version 6.1.0 milestone Oct 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linkify incorrectly parses array arguments #436

Linkify incorrectly parses array arguments #436

M1ha-Shvn commented Jan 18, 2019

willkg commented Jan 18, 2019

mastizada commented Oct 16, 2019 •

edited

filak commented Jun 2, 2022

willkg commented Jun 2, 2022

filak commented Jun 2, 2022

willkg commented Jun 2, 2022

willkg commented Oct 6, 2023

Linkify incorrectly parses array arguments #436

Linkify incorrectly parses array arguments #436

Comments

M1ha-Shvn commented Jan 18, 2019

willkg commented Jan 18, 2019

mastizada commented Oct 16, 2019 • edited

filak commented Jun 2, 2022

willkg commented Jun 2, 2022

filak commented Jun 2, 2022

willkg commented Jun 2, 2022

willkg commented Oct 6, 2023

mastizada commented Oct 16, 2019 •

edited