Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parser may be interpreting this string incorrectly #1029

Closed
jeffka11 opened this issue Apr 10, 2020 · 2 comments · Fixed by #1056
Closed

Parser may be interpreting this string incorrectly #1029

jeffka11 opened this issue Apr 10, 2020 · 2 comments · Fixed by #1056

Comments

@jeffka11
Copy link

jeffka11 commented Apr 10, 2020

below is using python 3.8 and dateutil 2.8.1

from dateutil import parser
parser.parse("£14.99 (25% off, until April 20)", fuzzy_with_tokens=True)
(datetime.datetime(2014, 4, 25, 20, 0), ('£', '(', '% off, until ', ' ', ')'))

I understand April 20 is vague so some assumptions have to be made. Removing the %, £, 14, .99, 25, and various combinations of those all give various forms of 14, 99, or 25 in the date. Below is the most detail I can retain to get an answer with a valid assumption.

>>> parser.parse("£. (% off, until April 20)", fuzzy_with_tokens=True)
(datetime.datetime(2020, 4, 20, 0, 0), ('£. (% off, until ', ' ', ')'))

Is dateutil expected to parse that string? Do you have any suggestions on how I should parse it? My first thought is to filter out regexes like [0-9]+% and [$|£]+[0-9]+(.[0-9])*\s to remove percentage numbers and currency + value combinations, but you may have run into better ways to manage this.

@ffe4
Copy link
Member

ffe4 commented Apr 10, 2020

No, dateutil is not expected to parse that type of string, although this is a common question that we should address in the docs. The parser basically identifies all elements that could be part of the date and then tries to find out where they go in the datetime object. The parser assumes that you pass it a string that contains mostly just the date, without giving it a hard time. The main purpose is to parse different date formats without having to define every possible format beforehand.

You will have to sanitize your input before passing it to the parser, although the best strategy for that will depend on what kinds of strings you want to parse. It will usually be stray numbers that cause problems, so if you know that you can reliably filter them out with regex that would be a good solution. Splitting your strings on punctuation marks and deciding which token is more likely to be a date (e.g. by identifying month names or prepositions) could also work. However, if your input is less predictable, you might want to consider natural language processing instead.

@jbrockmendel
Copy link
Contributor

While this isn't likely to be supported anytime soon, "maybe someday" cases like this are collected in test.test_parser.TestParseUnimplementedCases, could add a test there.

ffe4 added a commit to ffe4/dateutil that referenced this issue Jun 17, 2020
ffe4 added a commit to ffe4/dateutil that referenced this issue Jun 18, 2020
@ffe4 ffe4 linked a pull request Jun 18, 2020 that will close this issue
2 tasks
mariocj89 added a commit that referenced this issue Jul 5, 2021
Add and xfail unhandled case #1029
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants