Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify that key comparison is ordinal (closes #966) #993

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

marzer
Copy link
Contributor

@marzer marzer commented Sep 27, 2023

@SnoopJ identified in #966 that the TOML spec does not specify how keys are compared. We've discussed the matter to death at this point, and if I never hear the word "normalization" again it'll be too soon 😅, but to recap:

  1. Requiring normalization imposes significant implementation burden, and shuts out people who may explicitly not want that behaviour
  2. Recommending (but not requiring) normalization creates ecosystem fragmentation and round-trip issues
  3. Not saying anything about it at all leaves an ambiguous hole in the spec

So I'm taking @eksortso's proposal in his comment in #966 and running with it. This is not intended to complete with @arp242's #990 - they're orthogonal, really. Regardless of what we decide RE normalization (or lack thereof), we should say something about how keys are treated during comparison.

This also somewhat strengthens the idea that keys are just strings. Since they can be specified using string syntax, there's really no argument to be made for treating them as anything else anyway, IMO. TOML isn't a programming language.

@arp242
Copy link
Contributor

arp242 commented Sep 27, 2023

My main concern is that Windows users will input NFC, because that's what Windows does by default, and macOS users input NFD, because that's what macOS does by default.

Or something to that effect.

It's not really clear to me if that ever happens though, and/or if it differs per language. I guess that for the "simple case" of e.g. ë it's always NFC(?) But what about more complex cases? And do we want to bet it never changes in future versions of Windows/macOS/Android/$system?

I'm okay for quoted keys to always use byte comparisons, because they're used relatively infrequently. I'm a lot less convinced it's a good idea for bare keys though, and I'm afraid it might bite us in the ass, possibly years from now, if we allow both NFC and NFD in bare keys.

Regardless of what we decide RE normalization (or lack thereof), we should say something about how keys are treated during comparison.

We could leave it implementation-defined (and/or merely "recommended" rather than "required"). It's been discussed before, but it's certainly a possibility.

@marzer
Copy link
Contributor Author

marzer commented Sep 27, 2023

We could leave it implementation-defined (and/or merely "recommended" rather than "required")

Yeah, but then we have a fragmentation problem; a parser using ordinal key comparsion may accept a document that another parser using normalized key comparison would reject, and both would be conforming. Seems like a situation we ought to avoid?

Copy link

@SnoopJ SnoopJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no authority in the context of the TOML spec, but this satisfies my original desire for a clearly stated policy and is I think the least disruptive option from the discussion that doesn't close any doors for future decisions.

+1 from me

@arp242
Copy link
Contributor

arp242 commented Sep 27, 2023

We could leave it implementation-defined (and/or merely "recommended" rather than "required")

Yeah, but then we have a fragmentation problem; a parser using ordinal key comparsion may accept a document that another parser using normalized key comparison would reject, and both would be conforming. Seems like a situation we ought to avoid?

It's a trade-off, but it is a possibility.

The main argument against requiring normalisation is that it makes embedded harder, but this is probably not the kind of environment where you're going to use non-ascii in the first place? And how often do you want interoperability between a config file for an embedded use case and something else?

"Quoted TOML keys are always compared by byte, unquoted TOML keys should be normalized, except when the environment makes this prohibitively hard".

I don't really think it will lead to many practical problems: Python, Ruby, Go, etc. can all just add normalisation without too much effort, and toml++ and tomlc99 can just consider it "prohibitively hard", document it, and be done with it. Or make it optional via #define TOML_ENABLE_NORMALISATION 0.

@marzer
Copy link
Contributor Author

marzer commented Sep 28, 2023

You're right that unicode and embedded aren't usually bedfellows. Might not be an issue in practice. Now, anyway; I imagine that will change in the coming (20?) years as lower-level languages get better unicode support out-of-the-box (it's a hot topic on the C++ committee at the moment, for instance).

I'm not a fan of things being optional, though. So far TOML has specified behaviours in a one-size-fits-all manner, and I think that's a strength. For example, TOML integers must losslessly support the full range of signed 64-bit values, and TOML floats must implement IEEE 754 binary64. There's environments out there which don't support these, but the spec doesn't contain any provisions for those, so it's essentially a case of "oh, OK, I guess you need to emulate those, or be non-conforming". If I write a TOML document somewhere, I know it will work with any parser that passes conformance tests, and it would be a real shame if we broke that invariant by allowing something to be optional.

Having said that, adding an optional element to the spec is just a trade-off like any other, and it's not necessarily the worst thing in the world. I could live with it. I just think it'd be a shame to lose the uniformity. 🤷🏼‍♂️

@arp242
Copy link
Contributor

arp242 commented Sep 28, 2023

I'm not a fan either, but clearly something has got to give somewhere.

Or let me put it this way: do you not think that people inputting NFC and NFD for the same key is going to be a problem?

That's just my practical concern. As I mentioned in the issue last week: I feel people have dismissed this a little bit too casually, IMHO. Applications and libraries aren't adding normalisation for the fun of it, right? They do it because it solves issues. I really don't care if keys are considered "strings", "identifiers", "a series of numbers", or anything else. I just care about preventing practical issues, thinking from the perspective of someone writing a TOML document, who may not be technical, know about "combining characters", or may not even have a solid grasp on what a "string" is, exactly (other than "text that's not a number").

The entire reason I choose TOML in the first place for my translation tool is because I considered it the best commonly used format for non-technical people.

As a programmer, it's kind of my job to make things easier for people. Things should "Just Work™", with a minimum of fuss whenever possible. If you read back all my criticism from the last ~2 years then almost all of it can be summarized by that: "I don't think this will Just Work™, with a minimum of fuss, for non-technical TOML authors".

"Minimal" in TOML for me mostly means "Minimal" from the perspective of TOML authors, not so much implementations. Not that implementation minimalism is unimportant, but, well ... something has got to give. A hard requirement on normalisation is too much, but a "strong recommendation"? I could live with that: things will Just Work™, with a minimum of fuss, for non-technical TOML authors. And in the (somewhat rare) situations where there is no normalisation, it's 1) usually not the domain where Unicode is often used, and 2) the users are typically technical enough to figure things out. Seems like an okay trade-off to me.

@marzer
Copy link
Contributor Author

marzer commented Sep 28, 2023

Or let me put it this way: do you not think that people inputting NFC and NFD for the same key is going to be a problem?

No, honestly. It would have already been a problem with quoted keys and both times I've made a concerted effort to review the issue trackers of various TOML parsers for this (or any) normalization scenario I've come up short. Do you have any examples? I'm not saying there aren't any, I just haven't been able to find any myself. Could be that people have just been avoiding unicode in keys altogether because quoted keys are an extra annoyance, of course. Hard to really have hard data here.

thinking from the perspective of someone writing a TOML document, who may not be technical [...] or may not even have a solid grasp on what a "string" is

I understand where you're coming from more broadly, but how realistic a concern is this? Who's authoring TOML files without knowing what a string is, given that toml.io devotes a whole section to strings? We have to assume our users have some basic level of savviness, and I think "have looked at toml.io" is a good place to set that bar. It's not sustainable to cater to the absolute lowest common denominator for a project that is inherently a little bit tech-oriented.

@arp242
Copy link
Contributor

arp242 commented Sep 28, 2023

I don't think many people are using non-ASCII characters today, because no one likes using quoted keys. Like I've said before: it's not a coincidence no one brought up Unicode equivalence before we merged unicode bare keys (after which several did).

There's a lot of levels of "understanding" to "what a string is". I can tell most people "okay, so you write keyname equals-sign, and then some text between quotation marks", after which they "understand" what a string is, for some meaning of "understand" anyway.

Assuming a "basic level of savviness" is fine, but "understanding of combining characters" goes beyond that "basic level", IMO.

Hard to really have hard data here.

Yeah, exactly; I'm not really sure of anything here.

@arp242
Copy link
Contributor

arp242 commented Sep 28, 2023

YAML allows more or less everything in keys; I tried finding any issues or questions relating to Unicode equivalence, and I couldn't find any in the last 30 minutes.

I also can't recall ever seeing a YAML document with non-ASCII keys. So maybe no one will use this in TOML either and all this is for nothing 🤷

On the other hand, TOML is not YAML.

Anyway, I think I've said everything I've wanted to say on the issue. Maybe we should have a vote or something, or everyone lists the options they find "good enough" and see if there's something we all find "good enough", or something.

@erbsland-dev
Copy link

I fully support this addition to the standard.

Below is my rationale for keeping normalization separate from the parser itself.

1. Security

Applications and services typically read their configuration from a TOML file upon startup. Many of these processes start with administrative rights, which are usually revoked after the configuration is read. Introducing a vulnerability in the TOML parser would therefore pose a significant security risk, especially when running with elevated privileges. Minimizing dependencies helps to reduce this attack surface.

Implementing any form of Unicode normalization would necessitate either binding the parser to a complex library like ICU (which I'm not willing to do) or implementing secure normalization logic based on Unicode standards (something I've already done once and would rather not do again).

2. Non-issue

(Assuming that key comparison is defined as ordinal by the standard.)

From the Developer's Perspective:

For very niche use cases (of which I've found no examples), if a strictly defined normalization for key comparison is necessary, there's a straightforward solution. You can normalize the document before feeding it into the parser and ensure the keys in your application are normalized using the same form.

For edge cases like these, my parsers offer the flexibility to implement a custom stream reader. This allows developers to integrate normalization logic while reading the document and feeding the normalized code points into the parser.

From the User's Perspective:

I've yet to encounter configuration files that demand application-defined key names written in scripts where varying normalization forms would be problematic. I also haven't seen practical examples in real-world software in any discussions here.

The only scenario I can imagine is if a user creates a custom configuration—say, an action named "café"—and later references it using a differently normalized form of the key. While theoretically possible, it seems exceedingly rare.

In older software products I've developed, which are used internationally and rely on XML for configuration, I allow users to define element names in their local language, only disallowing control characters. I've never encountered an issue where an element couldn't be referenced because the sysadmin used two different normalization forms within the same configuration file.

@arp242
Copy link
Contributor

arp242 commented Sep 28, 2023

Implementing any form of Unicode normalization would necessitate either binding the parser to a complex library like ICU (which I'm not willing to do)

You don't need ICU as was just discussed the other day, but for your Qt library it's just built-in anyway: https://doc.qt.io/qt-6/qstring.html#normalized

@erbsland-dev
Copy link

You don't need ICU as was just discussed the other day, but for your Qt library it's just built-in: https://doc.qt.io/qt-6/qstring.html#normalized

My work mostly revolves around proprietary software, where I maintain different TOML parser implementations that don't rely on libraries with full normalisation capabilities.

Qt serves as an example regarding dependencies: the library eventually removed its dependency on ICU entirely, opting for its own implementation instead. While adding a few thousand more lines of code for Unicode tables isn't a significant burden for a large, self-contained library like Qt, which has its own string implementation, it is for small independent libraries.

@arp242
Copy link
Contributor

arp242 commented Sep 28, 2023

Well okay, but you still don't need ICU.

There are no inherent security problems with Unicode normalisation, at least not more so than any other string processing in languages where this can be a security problem (and with this standard you can use "security" as an argument for almost every discussion about TOML).

@erbsland-dev
Copy link

@arp242 I apologize, but your last reply has left me a bit puzzled.

There are no inherent security problems with Unicode normalisation, (...)

I never claimed that Unicode normalization itself poses a security risk; in fact, it addresses several issues. My concern centers around dependencies, especially those involving large and complex libraries, which can create security vulnerabilities, as I've previously pointed out.

(...) at least not more so than any other string processing in languages where this can be a security problem (...)

I have a fair amount of trust in the string processing capabilities of well-established implementations, like the C++ Standard Library. Full Unicode support may take some time to be integrated, but the focus is on trustworthiness. I'm far less comfortable relying on large, complex third-party libraries for string processing, particularly when their inclusion is solely for the purpose of string comparisons. Unicode support varies between languages, and that's something we should be mindful of.

The only way to avoid making a C++-based TOML parser dependent on an external library for full Unicode support is to implement it directly within the parser. This would necessitate a complete Unicode character property table, adding complexity.

(and with this standard you can use "security" as an argument for almost every discussion about TOML).

I'm not sure what you're getting at with this point. My security concerns are specific to parser implementation and its potential reliance on third-party libraries. That's precisely why I believe that any required normalization should take place before a TOML document is parsed, rather than being specified in the standard.

@arp242
Copy link
Contributor

arp242 commented Sep 28, 2023

utf8proc.c is 792 lines. It's not that large and includes other things, too. uninorms.cpp from unilib is roughly 150 lines. I'm sure other code is around too, but these were mentioned just a few days ago. Stop it with this "large and complex libraries" stuff already.

@erbsland-dev
Copy link

@arp242 It seems we're veering back into the discussion about code size and complexity, which feels like a diversion from the key points @marzer initially outlined:

(...) "We've discussed the matter to death at this point" (...)

  1. Requiring normalization imposes significant implementation burden, and shuts out people who may explicitly not want that behaviour
  2. Recommending (but not requiring) normalization creates ecosystem fragmentation and round-trip issues
  3. Not saying anything about it at all leaves an ambiguous hole in the spec

While you point out the low code complexity of normalization functions, it's important to consider the accompanying Unicode tables. In the case of utf8proc.c, the utf8proc_data.c file has roughly 17,000 lines of encoded data with embedded macros. When compiled, it results in a library that's around 340KB in size. Similarly, uninorms.cpp is a template that, when compiled with its Unicode tables, results in an object file of approximately 129KB. In both cases, these tables are essential for string normalization.

So while the lines of code for the normalization functions might appear minimal, it's important to also account for the sizable Unicode tables they rely on. These add an additional layer of complexity and size to the equation, and more critically, introduce a third-party library dependency.

The point here isn't to define what constitutes 'reasonable' complexity or size—that's subjective and varies based on the project and its requirements. Rather, it's to recognize that these additional complexities and dependencies are not trivial and may limit TOML's accessibility and utility for some developers. Let's agree that we've delved deeply into this topic and focus on what can realistically be achieved within the constraints of the TOML standard.

@arp242
Copy link
Contributor

arp242 commented Sep 28, 2023

I don't think you're seriously suggesting that the mere presence of 17,000 lines of Unicode tables is a security issue?

Your objection was about security, not code size or binary size:

Implementing any form of Unicode normalization would necessitate either binding the parser to a complex library like ICU (which I'm not willing to do) or implementing secure normalization logic based on Unicode standards (something I've already done once and would rather not do again).

And that's the bit I replied to. You're now trying to pivot it to something else, but that's not what you said.

If you want to say it's too complex: sure, fair enough. I've said that about many proposals. But I've also said that it's a subjective judgement call, and didn't try to invent "security issues" to make it sound more objective and authoritative than it is.

And sure, 130K is rather a lot. But then the discussion can be about "okay, so let's see if we can make that smaller", instead of a red herring about hypothetical non-existing security issues.

@marzer
Copy link
Contributor Author

marzer commented Sep 28, 2023

I'm sure other code is around too, but these were mentioned just a few days ago. Stop it with this "large and complex libraries" stuff already.

Come on, man. Forgetting the stuff about security here in this thread for a moment (because that's not an area I am anywhere near knowledgeable enough to comment on), @erbsland-dev has been consistently against the code complexity elements of unicode normalization throughout, and surely you have to concede that a solution which adds a hundred or so kilobytes of compiled code to an application or library binary is large and complex. You called me out for being reactionary and unloading on someone with points being made in good faith, and that was totally fair, but you are also guilty of it right here (to a lesser extent).

For any reasonable definition of complexity, unicode normalization code is complex, I'd argue disproportionately so given the possible non-benefit it would even yield in TOML's case. It just comes down to where that complexity gets compiled. An external dependency is a non-starter for some, and a greatly inflated binary is a non-starter for others. That was one of my earliest objections to it back in #941, and remains so today, and I don't think there's much to be gained re-litigating it.

My position on allowing it to be optional has relaxed since then though, so I don't see it as a blocker, but to simply say it's not an issue of complexity is false.

@arp242
Copy link
Contributor

arp242 commented Sep 28, 2023

surely you have to concede that a solution which adds a hundred or so kilobytes of compiled code to an application or library binary is large and complex

I made exactly this point way back in Nov 2022 (the bits about Go no longer apply actually, due to changes in the Go ecosystem, but C and C++ obviously still do), as well as in my previous reply a few hours ago. But it's a very different argument from "security".

"Too large" is fundamentally a judgement call. "Security" is not. In e.g. #562 you considered about 40 lines of code "simple enough" whereas I considered it "too complex". Neither of us is "correct" in any objective manner there. I mean, I could use "security" as an argument there too, right? Because fundamentally in some languages any careless string parsing can have potential security issues.

You called me out for being reactionary and unloading on someone with points being made in good faith, and that was totally fair, but you are also guilty of it right here (to a lesser extent).

Well, maybe some was worded a bit sharply, but the pivoting between "security" and "too large" is not something I'm hugely impressed by, and I'm putting this very diplomatically here. That's not because I'm against the "too large" argument, but because I consider it an abuse of the "security" argument.

@erbsland-dev
Copy link

@arp242 I thought I had clearly stated my reasoning, but perhaps I should've provided more context. If I understand you correctly, you're saying that because a 3rd-party library isn't complex, it shouldn't matter from a security standpoint. I apologize for any confusion; English isn't my first language.

Let's move past the discussion about how "complex" or "large" a library is. The issue is the inherent risk of incorporating a 3rd-party library that's developed outside of your direct control. It's not just about the complexity of the library itself, but also about the added complexity of integrating it into your codebase in a safe way.

A thorough security audit encompasses not only my code but also any external libraries that are used. Once a particular version of a library is audited and approved, you generally have to stick with it. Anytime the library gets updated, you're under pressure to update due to security concerns, but not before a new audit is conducted.

Then there's the option for a custom implementation, as Qt has done. This would involve including at least a partial Unicode character table, complete with decomposition type, decomposition mapping, and canonical combining class. Plus, you'd need additional mappings to optimize the algorithm. This code and data would only be used for key comparison in the parser to be compliant with the standard.

That leaves my preferred option: a code-point comparison of the keys. It has the lowest complexity, no dependencies, and allows for the most flexibility in how the parser is used. By stating that keys are compared by code-points, we set clear expectations for both implementers and users.

So that sums it up; I think we've covered all the bases. While I can't claim complete objectivity, I do fully support @marzer's proposal, as it aligns with what I believe is the best direction.

@SnoopJ
Copy link

SnoopJ commented Sep 28, 2023

With respect, I think that this pull request is the wrong place to discuss normalization unless it is directly related to this changeset. In my opinion, discussion of normalization fits better in the existing conversation on #966, or perhaps in a follow-up issue, supposing that this clarification of existing behavior is accepted. Since I helped light the fire here, I would be happy to file that follow-up and summarize the discussion so far, if that would be a helpful way to divide the distinct concerns of clarifying the existing spec vs. extending the spec.

@arp242, what do you think? Do you feel that the changes proposed by this PR should be rejected/deferred because of your worries about normalization?

@ChristianSi
Copy link
Contributor

ChristianSi commented Sep 29, 2023

I must saw I'm a bit baffled by how the discussion in this PR has turned towards detail questions about how to do normalization and how much effort is might be, when the PR itself is about clearly forbidding normalization. How did things get so sidetracked afterwards?

But never mind, and to get back to the PR: Looks good and I'm all in favor!

Personally, I wouldn't be unhappy about language added to the effect that:

While this is the default behavior of every compliant TOML processor, TOML processors might offer options for performing Unicode normalization (to a normalization form such as NFC or NFD) if explicitly requested by the user.

However, I realize that that might considered a complication and won't insist on it if there are reservations against such an additional clause.

@arp242
Copy link
Contributor

arp242 commented Sep 29, 2023

With respect, I think that this pull request is the wrong place to discuss normalization unless it is directly related to this changeset

@arp242, what do you think? Do you feel that the changes proposed by this PR should be rejected/deferred because of your worries about normalization?

This doesn't really address what to do with bare unquoted keys; so I wouldn't be in favour of merging this as-is. As I mentioned in #994, it's more or less the only option I strongly dislike.

@marzer
Copy link
Contributor Author

marzer commented Sep 29, 2023

This doesn't really address what to do with bare unquoted keys

Well then, that seems very simple to me: we should do the same thing for all keys, and not introduce different comparison methodology depending on the way the user chose to express the key, because that's confusing and user-hostile IMO. Keys are strings that just-so-happen to not need quotes sometimes; whatever gets decided later RE normalization, be it optional, mandatory, or forbidden, must be applied equally to all keys. This PR doesn't outright close the door to any of that - we can always update the clarifying example to reflect whatever gets decided (especially because the example in this PR is a "valid but discouraged" one RE keys which might cause issues later). All I'm trying to do here is remove the ambiguity that's already there.

If someone would like to propose wording, then I'm all ears. Otherwise I might consider amending this PR tomorrow with something of my own.

@arp242
Copy link
Contributor

arp242 commented Sep 29, 2023

not introduce different comparison methodology depending on the way the user chose to express the key, because that's confusing and user-hostile IMO

I (and others) have argued the opposite extensively: that not doing any normalisation with the current proposal is confusing and user-hostile 🤷😅

The current main is "unstable" and has a "bug"; we need to either: 1) fix things once and for all, 2) revert it to the current status quo. "Well, maybe patch this a bit here and see the rest later" is just going to continue the pain.

@marzer
Copy link
Contributor Author

marzer commented Sep 29, 2023

I (and others) have argued the opposite extensively: that not doing any normalisation with the current proposal is confusing and user-hostile

Yeah, and I'm not really sure how we square that circle, because it appears to be entirely down to what your mental model for keys is.

To be clear though, I'm not saying anything for or against normalization here, or that doing or not doing it is user-hostile; I'm saying that leaving room for key comparisons to be handled differently according to syntax is, and would prefer it to be uniform (whatever "it" ends up being). Obviously opinions on that will differ, too.

@arp242
Copy link
Contributor

arp242 commented Sep 29, 2023

I agree it's not brilliant, but "quoted keys are literal, unquoted at not" seems to make sense to me.

The only way this isn't going to cause problems is if people are not actually going to use non-ASCII characters in bare keys much, just like people aren't used quoted keys much today. That's actually a very real possibility.

This is basically the status in YAML, which allows more or less everything except control characters without normalisation, but I also don't think lots of people are using non-ASCII keys (not as far as I could find the other day anyway).

@marzer
Copy link
Contributor Author

marzer commented Sep 30, 2023

I agree it's not brilliant, but "quoted keys are literal, unquoted at not" seems to make sense to me.

I can see where you're coming from, it's very reasonable, just doesn't fit with my mental model.

In any case, I consider all of that orthogonal to this PR. To my mind, normalization is a transformation performed by a parser as it ingests keys, whereas this PR is just a clarifying statement about how they are compared with each other after that (i.e. after they may or may not have gone through any unicode normalization).

We can say "keys compare equal if their code point sequences are equal" now, and later amend it to say something like "implementations may normalize the code point sequences of bare keys during parsing" (for example), and still have it wholly compatible. In that spirit I'd like to move off more normalization discussion here in this PR.

@arp242
Copy link
Contributor

arp242 commented Sep 30, 2023

I mean, this PR describes the status quo (for quoted keys), so in that sense it's fine.

The thing is that if it gets merged it'll become the "main" and that's that. Is that a problem? Not really, except people are already (and have been, for quite some time) saying "we discussed this, it's merged, so don't bring it up again", even when you bring up issues that were never previously mentioned. So that's really my fear: whatever is in main has a certain type of inertia, and the current Unicode + no normalisation becomes the new "status quo", and it'll be so much harder to do anything about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants