Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open angle bracket '<' with few words after cleaned up if there's no closing bracket #705

Closed
alyohea opened this issue Apr 17, 2023 · 1 comment · Fixed by #721
Closed
Labels
untriaged Bug reports that haven't been triaged
Milestone

Comments

@alyohea
Copy link

alyohea commented Apr 17, 2023

Describe the bug

After #544 being fixed it seems the issue is still persist. But it reproducible in another way

  • Python Version: 3.8.13
  • Bleach Version: 6.0.0

To Reproduce

Steps to reproduce the behavior:

# Fixed!
In [2]: bleach.clean("<random")
Out[2]: '&lt;random'

# Fixed!
In [3]: bleach.clean("random<text")
Out[3]: 'random&lt;text'

# Problem!
In [4]: bleach.clean("<random text")
Out[4]: ''

Expected behavior

In [4]: bleach.clean("<random text")
Out[4]: '&lt;random text'

Additional context

Previously it was fixed by #667, so that < without > considered as eof-in-tag-name, but in the case above it's considered as EOF in the attribute name -- 'eof-in-attribute-name':

392  	        if last_error_token:
393 B->	            if last_error_token["data"] == "eof-in-tag-name":
394  	                # Handle the case where the text being parsed ends with <
395  	                # followed by a series of characters. It's treated as a tag
396  	                # name that abruptly ends, but we should treat that like
397  	                # character data
398  	                yield {
(Pdb) 
399  	                    "type": TAG_TOKEN_TYPE_CHARACTERS,
400  	                    "data": "<" + self.currentToken["name"],
401  	                }
402  	            else:
403  	                yield last_error_token
404  	
405  	    def consumeEntity(self, allowedChar=None, fromAttribute=False):
406  	        # If this tokenizer is set to consume entities, then we can let the
407  	        # superclass do its thing.
408  	        if self.consume_entities:
409  	            return super().consumeEntity(allowedChar, fromAttribute)
(Pdb) last_error_token
{'type': 7, 'data': 'eof-in-attribute-name'}
@alyohea alyohea added the untriaged Bug reports that haven't been triaged label Apr 17, 2023
@willkg willkg added this to the version 6.1.0 milestone Sep 18, 2023
willkg added a commit that referenced this issue Oct 6, 2023
This adds handling for two more cases:

1. something like "<word word". This throws an eof-in-attribute-name
   parser error.
2. something like "<word word=word". This throws an
   eof-in-attribute-value-no-quotes error.

Both of these work correctly now.
@willkg
Copy link
Member

willkg commented Oct 6, 2023

Thank you for writing this up!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
untriaged Bug reports that haven't been triaged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants