Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent ReDoS in Spanish sentence splitting regex #1084

Merged
merged 3 commits into from Jan 11, 2023

Conversation

Sjord
Copy link
Contributor

@Sjord Sjord commented Oct 12, 2022

In Spanish, questions start with an upside down question mark:

¿Vos bueno?

This was already handled in the original regex, but the original regex was vulnerable for regular expression denial of service (ReDoS). In the new regex, we either search for normal end-of-sentence optionally followed by a ¿ or ¡, or a ¿ or ¡ on its own. A change is that the normal end-of-sentence (.!?;…) has to come before the ¡ or ¿, but I think this is acceptable.

This PR also adds some Spanish test cases. These hit the sentence splitting logic, but the exact result of the splitting is not tested.

Fixes #869

@Sjord Sjord marked this pull request as draft October 12, 2022 12:31
@Sjord Sjord marked this pull request as ready for review October 12, 2022 12:49
@Gallaecio
Copy link
Member

Closing and reopening to re-trigger CI jobs…

@Gallaecio Gallaecio closed this Jan 11, 2023
@Gallaecio Gallaecio reopened this Jan 11, 2023
In Spanish, questions start with an upside down question mark:

> ¿Vos bueno?

This was already handled in the original regex, but the original regex was vulnerable for regular expression denial of service (ReDoS). In the new regex, we either search for normal end-of-sentence optionally followed by a ¿ or ¡, or a ¿ or ¡ on its own. A change is that the normal end-of-sentence (.!?;…) has to come before the ¡ or ¿, but I think this is acceptable.

This PR also adds some Spanish test cases. These hit the sentence splitting logic, but the exact result of the splitting is not tested.

Fixes scrapinghub#869
Consume whitespace if there is any, but still match if there isn't. This makes most sense for \n followed immediately by ¿. This also means we don't have to backtrack if there isn't any whitespace after a line ending.
@Sjord
Copy link
Contributor Author

Sjord commented Jan 11, 2023

I rebased to master.

The previous code wouldn't remove multiple empty strings in a row, due to
modifying the list during the loop. We use `filter` with the default identity
function instead.
@serhii73 serhii73 requested a review from wRAR January 11, 2023 16:26
@Gallaecio Gallaecio merged commit 769e4c0 into scrapinghub:master Jan 11, 2023
@serhii73
Copy link
Collaborator

We have a new release with this PR. Thank you @Sjord

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SECURITY: bad regex pattern in 'dateparser/languages/locale.py' will cause 'ReDos' security problem.
4 participants