Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

msgpack: truncate long string prefix refinements #167

Merged
merged 2 commits into from Aug 24, 2023

Conversation

kmoe
Copy link
Contributor

@kmoe kmoe commented Aug 23, 2023

Serialising an unknown value with a string prefix refinement with a sufficiently long prefix will result in an error on deserialisation, since cty limits the size of the refinements blob to 1024 bytes.

This PR is an implementation of the approach described in hashicorp/terraform#33464 (comment) for getting around this. Of course we would have to truncate the value while deserialising as well, otherwise the value does not round-trip correctly.

@apparentlymart I'm assuming that we don't want to impose a limit on prefix length when constructing the refinement in the first place (nor does the RefinementBuilder seem to allow for this, as far as I can tell). Silently truncating the prefix during serialisation seems like it suffices to preserve "approximations of refinements" under msgpack as documented, but if we really believe there is no proper use case for string prefix refinements greater than 256 bytes, can we let the consumer know more explicitly?

@apparentlymart
Copy link
Collaborator

Thanks for working on this!

The string prefix refinement is primarily intended for short prefixes like https:// or ami- or similar, so I find it a reasonable compromise to truncate longer prefixes during serialization to limit the amount of overhead refinements can create -- not all consumers of values will do anything with the refinements anyway, but every user of the msgpack serialization must still pay the cost of storing/transmitting them.

I considered handling this during serialization to be a pragmatic compromise because it means that applications which don't serialize their values can still benefit from the longer prefixes in the less common cases where they are relevant -- these strings were presumably already in RAM anyway, so that storage cost has already been paid -- and then (as you said) this becomes a reasonable "approximation" for applications that do use serialized values to transmit refinements, as Terraform does with its providers.

It is true that this is a situation where values will not round-trip losslessly. While that's non-ideal I think it's a reasonable compromise. It does mean, I suppose, that the TestRoundTrip test model will not work for this particular scenario, and so this would either need a new test specialized for this situation or a special rule in the TestRoundTrip test loop which considers an unknown string to have "round-tripped" correctly if the expected prefix has been preserved. The former (a new test) is probably the easier to implement, because then we can hand-write the expected value rather than having to describe a general rule for deriving it from arbitrary input.

Prior to this commit, serialising an unknown value with a string prefix refinement with a sufficiently long prefix would result in an error on deserialisation, since cty limits the size of the refinements blob to 1024 bytes.

We choose an arbitrary maximum length of 256 bytes and truncate any
string prefix refinement on serialisation.

This behaviour is an exception to the general case where refined values
will round-trip losslessly.
@kmoe kmoe force-pushed the msgpack-truncate-prefix-refinement branch from e702e81 to 55a402b Compare August 24, 2023 13:52
@@ -26,10 +26,6 @@ import (
// application can guarantee that the remainder of the string will not begin
// with combining marks then it is safe to instead just normalize the prefix
// string with [Normalize].
//
// Note that this function only takes into account normalization boundaries
// and does _not_ take into account grapheme cluster boundaries as defined
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is a mistake, unless "takes into account grapheme cluster boundaries" is intended to have a different meaning here than in line 14.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what I was intending to say here is that this function trims off zero or more normalization chunks so as to avoid the possibility that the resulting string might end with one that can be the prefix of a grapheme cluster chunk. Or in other words, this function potentially violates the usual rule that cty treats grapheme clusters as indivisible units under substring operations; it in fact intentionally divides them because it's trying to detect the possibility of partial ones.

But I agree that this comment is more confusing than helpful in retrospect, so I'm fine with removing it and if some confusion arises later about this the discussion about it will hopefully suggest a clearer way to explain the tradeoff this function is making.

@kmoe
Copy link
Contributor Author

kmoe commented Aug 24, 2023

Thanks @apparentlymart, I've created a new test for this case demonstrating the truncation behaviour.

cty.UnknownVal(cty.String).Refine().StringPrefix(strings.Repeat("c", 255) + "-").NewValue(),
},
{
cty.UnknownVal(cty.String).Refine().StringPrefix(strings.Repeat("d", 255) + "🤷🤷").NewValue(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to add comments for any of these that need it. For example, in this case the StringPrefix() call trims the final "🤷", then the remaining "🤷" is trimmed during serialisation.

@apparentlymart
Copy link
Collaborator

This looks great to me, so I'm going to merge it now. Thanks!

@apparentlymart apparentlymart merged commit 560dd28 into zclconf:main Aug 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants