Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Be slightly more strict around whitespace parsing #1130

Closed
wants to merge 5 commits into from

Conversation

pitdicker
Copy link
Collaborator

@pitdicker pitdicker commented Jun 6, 2023

#807 made a couple of changes that weren't obvious to me on first glance.

I would like to introduce the changes one by one (but in my own way to be honest...)

When parsing, Item::Space

  • Previously accepted an unlimited number of Unicode whitespaces. It accepted zero number of whitespace characters. And the whitespace characters did not have to match that of the formatting string.
  • After Issue #660 whitespace exact #807 it only accepted the same literal as in the formatting string.
  • With this PR will accept an unlimited number of Unicode whitespaces which do not have to match that of the formatting string. With the change that now there has to be at least one character.

That any whitespace in the formatting string would correspond to optional whitespace when parsing was een undocumented feature.

Some examples that were previously accepted:

  • "%a, %d %b %Y" accepts "Sat, 09 Aug 2013" and "Sat,09Aug2013"
  • "%P %H:%M" accepts "PM 12:59" and "PM12:59"
  • "%H:%M %P" accepts "12:59 AM" and "12:59AM"
  • "%r" accepts "12:34:60 AM", "12:34:60AM"
  • any spaces inside %c, the local date and time format, were optional.
  • "%F %T %Z" accepts "2001-07-08 00:34:60 UTC" and "2001-07-0800:34:60UTC"

I would say that at least in some of these cases a user would not expect the string to parse.

The two cases where I would expect breakage are a space before the AM/PM suffix behind a time ("%H:%M %P"), and a space before the timezone abbreviation. To allow creating a formatting string that can parse such cases I made three more changes:

  • A formatting specifier that can parse optional spaces: % . It will be encoded as Item::Space("").
  • Formatting will write a space for Item::Space(""), where it would write nothing before. This item could only be created manually before, not parsed from a formatting string. And I don't see a reason someone would write, as it did nothing when formatting.
  • %r is changed to make the space between the time and AM/PM optional.

@pitdicker
Copy link
Collaborator Author

@djc How would you consider this breakage? It is quite a bit more minimal than in #807.

With a GitHub code search for %H:%M %P I found two uses for parsing:

  • As an example in the Rust Cookbook
  • As a test in nu-command
    (and in both the sample string included a space, so they would not break)

Searching for %H:%M:%S %P found zero uses for parsing.

%Z is hard to search for, because it also matches %z, which is way more common.

@pitdicker pitdicker force-pushed the space_parsing_item branch 5 times, most recently from 13ebdde to dba1b93 Compare June 8, 2023 16:42
@jtmoon79
Copy link
Contributor

jtmoon79 commented Jul 9, 2023

With this PR will accept an unlimited number of Unicode whitespaces which do not have to match that of the formatting string. With the change that now there has to be at least one character.

IIUC, this PR is proposing to allow formatting string "%a, %d %b %Y" to match on values:

  • "Sat, 09 Aug 2013"
  • "Sat ,09Aug \n\n\n\n\t\t\t\v 2013"

Did I understand the proposal of unlimited whitespace correctly? If so, I think this is a very bad idea.

@pitdicker
Copy link
Collaborator Author

We currently accept what you describe, and accept no whitespace at all. The change is to require at least one whitespace character.

I agree that it doesn't make much sense to accept "Sat ,09Aug \n\n\n\n\t\t\t\v 2013" as a valid string. Yet some other cases do make sense, such as "Sat 9 aug 2013" (two spaces), "2023-07-09 20:16:00" (figure space), or "2023-07-09 20:16:00" (no-break space).

I just want to make sure we can make your work in #807 stick. But it seems to me it may break a little too many reasonable cases. And we don't have a Space Item for nothing, it offered flexibility beyond Literal.

We can consider the two spaces in "Sat 9 aug 2013" a space and padding.
@jtmoon79 @djc What do you think of this rule: The Space item accept any Unicode whitespace, but the number of characters must match?

@jtmoon79
Copy link
Contributor

jtmoon79 commented Jul 9, 2023

I just want to make sure we can make your work in #807 stick. But it seems to me it may break a little too many reasonable cases. And we don't have a Space Item for nothing, it offered flexibility beyond Literal.

We can consider the two spaces in "Sat 9 aug 2013" a space and padding. @jtmoon79 @djc What do you think of this rule: The Space item accept any Unicode whitespace, but the number of characters must match?

I don't find the whitespace ambiguousness to be reasonable. Unless proscribed by an RFC, all characters in the format string and value string should match. This loose matching behavior I found surprising and I had to create a workaround for this oddity (jtmoon79/super-speedy-syslog-searcher#6).

I am open to allowing a new format specifier that allows for limited optional whitespace (I think "zero or one whitespace characters" is the best choice for this proposed specifier). This provides a good choice and would satisfy both "camps" of users; those that prefer precision and those that prefer whitespace flexibility.

@pitdicker pitdicker marked this pull request as draft July 27, 2023 18:47
@pitdicker pitdicker closed this Feb 11, 2024
@pitdicker pitdicker deleted the space_parsing_item branch February 11, 2024 20:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants