Prevent ReDoS in Spanish sentence splitting regex #1084

Sjord · 2022-10-12T10:31:07Z

In Spanish, questions start with an upside down question mark:

¿Vos bueno?

This was already handled in the original regex, but the original regex was vulnerable for regular expression denial of service (ReDoS). In the new regex, we either search for normal end-of-sentence optionally followed by a ¿ or ¡, or a ¿ or ¡ on its own. A change is that the normal end-of-sentence (.!?;…) has to come before the ¡ or ¿, but I think this is acceptable.

This PR also adds some Spanish test cases. These hit the sentence splitting logic, but the exact result of the splitting is not tested.

Fixes #869

Gallaecio · 2023-01-11T13:33:14Z

Closing and reopening to re-trigger CI jobs…

In Spanish, questions start with an upside down question mark: > ¿Vos bueno? This was already handled in the original regex, but the original regex was vulnerable for regular expression denial of service (ReDoS). In the new regex, we either search for normal end-of-sentence optionally followed by a ¿ or ¡, or a ¿ or ¡ on its own. A change is that the normal end-of-sentence (.!?;…) has to come before the ¡ or ¿, but I think this is acceptable. This PR also adds some Spanish test cases. These hit the sentence splitting logic, but the exact result of the splitting is not tested. Fixes scrapinghub#869

Consume whitespace if there is any, but still match if there isn't. This makes most sense for \n followed immediately by ¿. This also means we don't have to backtrack if there isn't any whitespace after a line ending.

Sjord · 2023-01-11T14:57:05Z

I rebased to master.

The previous code wouldn't remove multiple empty strings in a row, due to modifying the list during the loop. We use `filter` with the default identity function instead.

serhii73 · 2023-01-12T09:18:37Z

We have a new release with this PR. Thank you @Sjord

apply scrapinghub/dateparser#1084

Sjord marked this pull request as draft October 12, 2022 12:31

Sjord marked this pull request as ready for review October 12, 2022 12:49

Gallaecio approved these changes Jan 11, 2023

View reviewed changes

Gallaecio closed this Jan 11, 2023

Gallaecio reopened this Jan 11, 2023

Sjord added 2 commits January 11, 2023 15:56

Don't require whitespace.

ef78400

Consume whitespace if there is any, but still match if there isn't. This makes most sense for \n followed immediately by ¿. This also means we don't have to backtrack if there isn't any whitespace after a line ending.

Sjord force-pushed the fix-spanish-regexdos branch from fe59143 to ef78400 Compare January 11, 2023 14:56

Correctly filter out empty sentences

7a63e6e

The previous code wouldn't remove multiple empty strings in a row, due to modifying the list during the loop. We use `filter` with the default identity function instead.

serhii73 approved these changes Jan 11, 2023

View reviewed changes

serhii73 requested a review from wRAR January 11, 2023 16:26

wRAR approved these changes Jan 11, 2023

View reviewed changes

Gallaecio merged commit 769e4c0 into scrapinghub:master Jan 11, 2023

dotlambda added a commit to dotlambda/nixpkgs that referenced this pull request Jan 24, 2023

python310Packages.dateparser: patch ReDoS

5b6aff4

apply scrapinghub/dateparser#1084

dotlambda mentioned this pull request Jan 24, 2023

[22.11] python310Packages.dateparser: patch ReDoS NixOS/nixpkgs#212352

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent ReDoS in Spanish sentence splitting regex #1084

Prevent ReDoS in Spanish sentence splitting regex #1084

Sjord commented Oct 12, 2022

Gallaecio commented Jan 11, 2023

Sjord commented Jan 11, 2023

serhii73 commented Jan 12, 2023

Prevent ReDoS in Spanish sentence splitting regex #1084

Prevent ReDoS in Spanish sentence splitting regex #1084

Conversation

Sjord commented Oct 12, 2022

Gallaecio commented Jan 11, 2023

Sjord commented Jan 11, 2023

serhii73 commented Jan 12, 2023