libnet/i/defaultipam: Disambiguate PoolID string format #47837

akerouanton · 2024-05-16T15:22:39Z

- What I did

Prior to this change PoolID microformat was using slashes to separate fields. Those fields include subnet prefixes in CIDR notation, which also includes a slash. This makes future evolution harder than it should be.

This change introduces a 'v2' microformat which uses: 1. named fields to disambiguate which field each value is associated to; 2. semicolons as a separator.

The 'v1' encoding will be kept until the next major MCR LTS is released after v27.

- How to verify it

CI & tests.
Manual testing -- create a network on master, check out this branch, restart the daemon.

- A picture of a cute animal (not mandatory but encouraged)

robmry

A much better format!

I guess it'll be impossible to downgrade without deleting networks, because old builds won't recognise the new format? If a pool doesn't have any new features that make it unsafe for an old build to use (the normal case, for all existing networks) - perhaps it'd be better to generate the pool id in the old format?

robmry · 2024-05-16T15:46:27Z

libnetwork/ipams/defaultipam/structures.go

+	str = strings.TrimSuffix(strings.TrimPrefix(str, "PoolID{"), "}")
+
+	for _, field := range strings.FieldsFunc(str, func(c rune) bool { return c == ';' }) {
+		p := strings.SplitN(field, "=", 2)


It's probably worth checking SplitN found an = and returned two strings, before using p[1].

(Or, perhaps better - the docs for SplitN say "To split around the first instance of a separator, see Cut", which returns two strings and an "ok".)

strings.Cut is a better option overall; https://pkg.go.dev/strings#Cut, but it would ignore multiple = in the string etc, in case we need to care about those.

robmry · 2024-05-16T16:25:59Z

libnetwork/ipams/defaultipam/allocator_test.go

 	k2, err = PoolIDFromString(expected)
-	if err != nil {
-		t.Fatal(err)
-	}
+	assert.NilError(t, err)
+
 	if k2.AddressSpace != k.AddressSpace || k2.Subnet != k.Subnet || k2.ChildSubnet != k.ChildSubnet {
 		t.Fatalf("SubnetKey.FromString() failed. Expected %v. Got %v", k, k2)
 	}


Maybe worth testing PoolIDFromString with a v2 string that has no ChildSubnet? (And some invalid strings to check the InvalidParameter code paths?)

(Could also convert more of the old test to use Equal rather than comparing individual fields, and use assert.Check(t, is.Equal(...)).)

robmry · 2024-05-16T16:53:44Z

libnetwork/ipams/defaultipam/structures.go

+		switch p[0] {
+		case "AddressSpace":
+			pID.AddressSpace = p[1]
+		case "Subnet":
+			if pID.Subnet, err = netip.ParsePrefix(p[1]); err != nil {
+				return PoolID{}, types.InvalidParameterErrorf("invalid string form for subnetkey: %s", str)
+			}
+		case "ChildSubnet":
+			if pID.ChildSubnet, err = netip.ParsePrefix(p[1]); err != nil {
+				return PoolID{}, types.InvalidParameterErrorf("invalid string form for subnetkey: %s", str)
+			}
+		}


Is the plan to add more fields? ... this will silently drop fields it doesn't recognise, is that likely to be safe behaviour on downgrade?

I guess it's impossible to predict, but it might be safest to bomb out? Or, could come up with a naming convention that says whether the address space is safe to use without understanding a field?

(Could log unknown fields, but that might not help much.)

Is the plan to add more fields? ... this will silently drop fields it doesn't recognise, is that likely to be safe behaviour on downgrade?

Yeah, we might need a new field to properly support duplicate static allocations (to uniquely identify each duplicate).

this will silently drop fields it doesn't recognise, is that likely to be safe behaviour on downgrade? I guess it's impossible to predict, but it might be safest to bomb out?

Indeed, my intention was to make it easy to add new fields while providing a safe / easy downgrade path.

Or, could come up with a naming convention that says whether the address space is safe to use without understanding a field?

Hm not sure to see what you mean 🤔 We can't predict what future fields (if any) will look like. Each major release will understand a specific set of fields, matching a specific implementation of the Allocator / addrSpace.

(Could log unknown fields, but that might not help much.)

We could log, yeah. But this would happen only on downgrade (so not really useful) or if we commit a big mistake (eg. a typo in a field name in the serialize code). Not sure if that's really valuable 🤔

Hm not sure to see what you mean 🤔 We can't predict what future fields (if any) will look like. Each major release will understand a specific set of fields, matching a specific implementation of the Allocator / addrSpace.

I was thinking there may be two categories of new field in the id ... informational, some annotation for debug/logging/inspect that old builds can safely ignore. Or, something new that means an old daemon can't safely use the pool because it doesn't know how it works.

The duplicate-tracking field is probably in the second category? I guess an old daemon that didn't understand the field would see duplicate pool-ids, maybe lose data from one of the pools, and generally be a bit weird? (So, we'd want it to fail if the new id-field is used. In which case, it's an example of a field that shouldn't be silently ignored.)

For future backwards-compatibility - we could come up with a convention along the lines of "an unknown field name with prefix 'Info' can be silently ignored, other unknown fields mean the pool is unusable". We should probably make sure the daemon starts with unusable pools, so that they can be deleted and re-created. Maybe it's not necessary, I don't have examples, but this would be the time to build it in if it might be needed.

akerouanton · 2024-05-16T18:27:51Z

I guess it'll be impossible to downgrade without deleting networks, because old builds won't recognise the new format?

Yeah, correct. This implementation doesn't offer a downgrade path. I think it'd be better to ask users to backup their networkdb before upgrading, and restore it if they need to downgrade.

If a pool doesn't have any new features that make it unsafe for an old build to use (the normal case, for all existing networks) - perhaps it'd be better to generate the pool id in the old format?

We could, but we'll have to transition to the v2 format at some point. I'd prefer to kill it right away.

robmry · 2024-05-16T18:58:20Z

We could, but we'll have to transition to the v2 format at some point. I'd prefer to kill it right away.

Once the new format's existed for a release-or-so, it'll be fine to store the new format even where the old format would work, because old builds will understand the new format.

Most people won't read/follow instructions to back up the network-db (!). Even if they do, if they make changes between upgrading and discovering a problem that means they need to roll-back, it'll still be a problem - and they'll already be annoyed by that point. Also, after downgrading, I think there will just be a strange 'invalid parameter' error logged during startup (?) ... so it won't be obvious that the network-db has to be deleted, and networks re-created. (Bugs that make downgrades necessary will have a much bigger impact.)

Even for development, I sometimes end up flipping between versions to compare behaviour. Re-creating networks in that case wouldn't be the end of the world, just more-faff.

corhere · 2024-05-16T19:49:54Z

IIRC in the IPAM contract, the PoolID string is an opaque handle to an allocated IPAM pool. I don't think it is even exposed to the user anywhere in the Engine API. It might show up in logs so a printable string would be nice to have, but that's about it. Aside from that, all defaultipam needs is to be able to round-trip some KV pairs through a string. Why do we need to roll a fully custom microformat when there are so many codecs — in stdlib, even! — which are fit for purpose?

Strawman: encode the KV pairs using some codec that we don't have to write the marshaler or unmarshaller for, prepending it with some magic string identifying it as v2 format. The only parsing necessary would be strings.CutPrefix. For example:

JSON: PoolID{"AddressSpace":"default","Subnet":"172.27.0.0/16"}
application/x-www-form-urlencoded: pool:addressspace=default&subnet=172.27.0.0%2F16

Speaking of downgrades, while we can't support downgrades to a version that does not understand the v2 PoolID string format, we could plan ahead a little bit to make the v2 format more amenable to backwards compatibility to go along with the extensibility. The PNG format, for instance, has a neat trick for extensibility: each "chunk" of the file is encoded with a header that signifies the chunk type. The really clever bit is that the chunk type also encodes a "critical/ancillary" flag. PNG parsers encountering an unknown chunk examine the flag to determine whether to skip over the chunk or error out. (PNG chunk tags are 4-char strings. Ones starting with an uppercase char are critical, lowercase ancillary.) I think we should come up with some scheme to signify whether a particular PoolID KV pair is critical or ancillary, and have the parser fail to unmarshal unknown critical KVs and discard ancillary ones.

thaJeztah · 2024-05-16T19:49:55Z

Silly question (sorry didn't go through all the comments); is this parsing in a "hot-path", or would using JSON work for this? (at least with JSON we'd have a format that we know works, and we wouldn't have to come up with our own; it would also be extensible (adding more fields)).

thaJeztah · 2024-05-16T19:51:05Z

Oh, LOL, looks like Cory and I commented at the same time. 😂 (his post is definitely more in-depth than mine)

robmry · 2024-05-16T21:57:56Z

We could backport this change to 26.1.x - then at least it'll be possible to roll back to 26.1.latest, and we can ditch the old format now. We could also make sure 26.1 does something sensible if it finds an id it doesn't understand.

(There's quite a lot of networking change going into 27.0, and our track record isn't great. I think it's worth planning for problems.)

akerouanton · 2024-05-17T12:21:22Z

We talked about downgrades with @robmry during a 1:1 and his idea is pretty neat: backport the deserializer to v26.1 but keep the v1 serializer.

As he mentioned, v27 is going to be quite heavy in terms of networking changes, so it carries some risks. This would help offset those risks by giving users an escape hatch if things go wrong. And I guess we'd need to backport the deserializer to v25 too for MCR.

Why do we need to roll a fully custom microformat when there are so many codecs — in stdlib, even! — which are fit for purpose?

I excluded JSON marshalling specifically because json.Marshal() returns an error. I thought it'd be preferable to write an error-free code than ignoring a marshalling error. A handwritten deserializer also gives us more latitude if we want to bake some custom logic in it. For instance, the 'critical/anciliary' flag you're proposing -- I fail to see how you'd implement that with just json.Marshal().

Speaking of downgrades, while we can't support downgrades to a version that does not understand the v2 PoolID string format, we could plan ahead a little bit to make the v2 format
more amenable to backwards compatibility to go along with the extensibility. [...]

That seems to be quite involved / over-engineered. At this point we don't plan to make any changes to PoolID beyond what's needed to implement proper support for duplicate static allocations. I'd prefer to stick with Rob's idea in the future: backport PoolID deserializer to the current latest if there are incompatible changes made.

thaJeztah · 2024-05-17T13:57:52Z

I excluded JSON marshalling specifically because json.Marshal() returns an error. I thought it'd be preferable to write an error-free code than ignoring a marshalling error.

I'm not exactly sure what scenario you had in mind here; wouldn't the current code already produce errors in many scenarios?

It may (currently) ignore some case, such as a string starting with PoolID{, but missing a closing } Currently it also would accept keys to be repeated (multiple AddressSpace, Subnet or ChildSubnet keys), question is do we want to ignore all of those cases?

akerouanton · 2024-05-17T15:29:44Z

I'm not exactly sure what scenario you had in mind here; wouldn't the current code already produce errors in many scenarios?

It may (currently) ignore some case, such as a string starting with PoolID{, but missing a closing } Currently it also would accept keys to be repeated (multiple AddressSpace, Subnet or ChildSubnet keys), question is do we want to ignore all of those cases?

Serialization is done in String() string -- we can't return an error. AFAIK, all stdlib codecs' unmarshalers return an error. If we use one of them, we'll end up silencing that error. Maybe that's not that important in the end.

Another, and maybe more important downside of stdlib's codecs: if we want something custom we have to implement a specific marshaler interface. One concrete example: to add proper support for duplicate static allocations, I need to add a new field 'AllocID' to 'PoolID', but I need to distinguish between that field being not set and being the zero value. If I use json marshaling, I'll end up writing my own marshaler.

FWIW, I'm ruling out url encoding as the stringified PoolID might end up in logs and error messages. I'd prefer to keep it easily readable both for us and for end-users.

corhere · 2024-05-17T20:29:02Z

I excluded JSON marshalling specifically because json.Marshal() returns an error. I thought it'd be preferable to write an error-free code than ignoring a marshalling error.

Don't ignore the error, then!

if err != nil {
        panic(err)
}

json.Marshal() returns errors in specific circumstances, which are all documented. It will never return an error if you only pass it marshal-able data. E.g. json.Marshal(map[string]string{...}) will never return an error.

A handwritten deserializer also gives us more latitude if we want to bake some custom logic in it. For instance, the 'critical/anciliary' flag you're proposing -- I fail to see how you'd implement that with just json.Marshal().

Well, you wouldn't. json.Marshal and json.Unmarshal are only concerned with the syntax and structure of arbitrary JSON documents. The criticalness of a field is a semantic property of the data, independent of any "on-the-wire" representation.

The PoolID message is an unordered mapping of string -> string (a.k.a. map[string]string). Suppose we use the convention that a critical key is signified by the first char being uppercase ASCII (like PNG). Validating whether the document is comprehensible by the consumer is part of the semantic analysis of the message, which can only be done once the message has been parsed: Validate(map[string]string) error. Any codec could be used which is able to marshal a map[string]string to a string and unmarshal back to a map[string]string. And it could be substituted with any other codec to yield a functionally equivalent — albeit incompatible — implementation. The same validation logic could be used, independently of the codec.

Designing and implementing an extensible serialization format for structured data, even flat K-V pairs, is not a trivial endeavour. The grammar needs to be unambiguous, invalid syntax needs to be rejected, and some scheme is necessary for escaping syntactically-significant productions embedded in user data. Case in point: with the PoolID{...} v2 serialization format, how could you round-trip the struct literal PoolID{AddressSpace: "foo;bar"}? I'm sure you could come up with some scheme for escaping semicolons, and escaping the escape character, but now the parser is more complex and needs more extensive testing. Using an off-the-shelf codec lets us leave all those fiddly details to an already tested and proven solution that we don't need to test or review ourselves. That frees us to focus on the interesting bits: the semantics of the messages.

marshaling to a JSON object is trivial. Just marshal a map or a struct with json tags. On the unmarshal side, simply apply the semantic analysis to successfully-unmarshaled messages.

robmry · 2024-05-21T10:01:14Z

libnetwork/ipams/defaultipam/structures.go

+	data := strings.TrimPrefix(str, poolIDV2Prefix)
+
+	if err := json.Unmarshal([]byte(data), &pID); err != nil {
+		return PoolID{}, err


This should probably get the types.InvalidParameterErrorf() treatment too? (Although they might all be internal errors really, as this isn't user input?)

I fixed that in the original commit and added a new one to convert all InvalidParameter into InternalErrors.

akerouanton · 2024-05-22T07:17:49Z

I've updated this PR to use a JSON codec instead of my handwritten (un)marshaler. I'm using a map of strings as I'll need to analyze which fields are set in a future PR.

--

I'm still a lot skeptical about the 'critical / anciliary' fields thingy.

First, because I fail to see an example where that would be useful. I'm planning to add a new 'AllocID' field and we might consider making the allocator VNI-aware. But that's about it, there's no other fields we plan to add in the foreseeable future.

Second, because we really want downgrades to be possible without any errors -- 'critical' fields are going to make networks unusable. It seems the backporting strategy suggested by @robmry is more suited for that. It'd allow to decide on a case by case how downgrade scenarios should be handled for every new fields (at least if older versions shouldn't ignore those fields). This gives the ability to handle them graciously.

Prior to this change PoolID microformat was using slashes to separate fields. Those fields include subnet prefixes in CIDR notation, which also include a slash. This makes future evolution harder than it should be. This change introduces a 'v2' microformat based on JSON. This has two advantages: 1. Fields are clearly named to ensure each value is associated to the right field. 2. Field values and separators are clearly distinguished to remove any ambiguity. The 'v1' encoding will be kept until the next major MCR LTS is released. Signed-off-by: Albin Kerouanton <albinker@gmail.com>

…rrof InvalidParameterErrorf was used whenever an invalid value was found during PoolID unmarshaling. This error is converted to a 400 HTTP code by the HTTP server. However, users never provide PoolIDs directly -- these are constructed from user-supplied values which are already validated when the PoolID is marshaled. Hence, if such erroneous value is found, it's an internal error and should be converted to a 500. Signed-off-by: Albin Kerouanton <albinker@gmail.com>

corhere · 2024-05-22T19:24:38Z

First, because I fail to see an example where that would be useful. I'm planning to add a new 'AllocID' field and we might consider making the allocator VNI-aware. But that's about it, there's no other fields we plan to add in the foreseeable future.

It's for the unforeseen circumstances.

Second, because we really want downgrades to be possible without any errors -- 'critical' fields are going to make networks unusable. It seems the backporting strategy suggested by @robmry is more suited for that.

The two are not mutually exclusive. I would say that they go hand in hand. Flagging a field as critical affords fail fast behaviour when downgrading too far, alerting the operator that they may need to look for a newer patch version of the daemon with the backport. Otherwise, if the field was truly critical, the network may be silently (or subtly) broken on a downgraded engine that does not understand the field. If a downgraded engine does not need to understand the field to behave correctly, the field is ancillary by definition.

corhere · 2024-05-22T19:27:43Z

libnetwork/ipams/defaultipam/structures.go

+	if strings.HasPrefix(str, poolIDV2Prefix) {
+		return parsePoolIDV2(str)
+	}


May as well test and trim in the same operation.

Suggested change

if strings.HasPrefix(str, poolIDV2Prefix) {

return parsePoolIDV2(str)

}

if v, ok := strings.CutPrefix(str, poolIDV2Prefix); ok {

return parsePoolIDV2(v)

}

Yeah I considered that but decided to pass the unaltered string instead to have it in error messages.

akerouanton added status/2-code-review area/networking kind/refactor PR's that refactor, or clean-up code area/networking/ipam labels May 16, 2024

akerouanton added this to the 27.0.0 milestone May 16, 2024

akerouanton requested review from corhere and robmry May 16, 2024 15:22

akerouanton self-assigned this May 16, 2024

robmry reviewed May 16, 2024

View reviewed changes

akerouanton force-pushed the libnet-ipam-disambiguate-PoolID branch 2 times, most recently from b3f1071 to d3584ae Compare May 21, 2024 09:34

robmry reviewed May 21, 2024

View reviewed changes

akerouanton force-pushed the libnet-ipam-disambiguate-PoolID branch 2 times, most recently from 3e6645a to eab09b1 Compare May 21, 2024 11:48

robmry approved these changes May 21, 2024

View reviewed changes

akerouanton force-pushed the libnet-ipam-disambiguate-PoolID branch from eab09b1 to cd9ccba Compare May 22, 2024 06:58

akerouanton requested a review from robmry May 22, 2024 07:01

akerouanton added 2 commits May 22, 2024 10:02

akerouanton force-pushed the libnet-ipam-disambiguate-PoolID branch from cd9ccba to 5a2fa59 Compare May 22, 2024 08:03

robmry approved these changes May 22, 2024

View reviewed changes

corhere approved these changes May 22, 2024

View reviewed changes

akerouanton merged commit 5f183b9 into moby:master May 22, 2024
126 checks passed

akerouanton deleted the libnet-ipam-disambiguate-PoolID branch May 22, 2024 20:52

akerouanton mentioned this pull request May 23, 2024

[26.1 backport] libnet/ipam: Decode PoolID v2 format #47852

Closed

akerouanton added process/cherry-picked and removed status/2-code-review labels May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

libnet/i/defaultipam: Disambiguate PoolID string format #47837

libnet/i/defaultipam: Disambiguate PoolID string format #47837

akerouanton commented May 16, 2024

robmry left a comment

robmry May 16, 2024

thaJeztah May 17, 2024

robmry May 16, 2024

robmry May 16, 2024

akerouanton May 16, 2024

robmry May 16, 2024

akerouanton commented May 16, 2024

robmry commented May 16, 2024

corhere commented May 16, 2024

thaJeztah commented May 16, 2024

thaJeztah commented May 16, 2024

robmry commented May 16, 2024

akerouanton commented May 17, 2024

thaJeztah commented May 17, 2024

akerouanton commented May 17, 2024 •

edited

corhere commented May 17, 2024

robmry May 21, 2024

akerouanton May 21, 2024

akerouanton commented May 22, 2024

corhere commented May 22, 2024 •

edited

corhere May 22, 2024

akerouanton May 22, 2024

libnet/i/defaultipam: Disambiguate PoolID string format #47837

libnet/i/defaultipam: Disambiguate PoolID string format #47837

Conversation

akerouanton commented May 16, 2024

robmry left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akerouanton commented May 16, 2024

robmry commented May 16, 2024

corhere commented May 16, 2024

thaJeztah commented May 16, 2024

thaJeztah commented May 16, 2024

robmry commented May 16, 2024

akerouanton commented May 17, 2024

thaJeztah commented May 17, 2024

akerouanton commented May 17, 2024 • edited

corhere commented May 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akerouanton commented May 22, 2024

corhere commented May 22, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akerouanton commented May 17, 2024 •

edited

corhere commented May 22, 2024 •

edited