Revert "Fix for the SDS update failure (#615)" as no longer needed on top of #559 #657

valerian-roche · 2023-03-13T22:19:26Z

As discussed in this issue, the fix done in #615 is creating issues:

it is no longer applicable on top of delta: [#558] Support dynamic wildcard subscription in delta-xds #559 which already makes sure subscription changes are properly reflected without modifying the VersionMap
it is invalid in the context of new resources being subscribed to or being removed, leading to further issues in this context (e.g. a resource deletion will never be notified if not originally watched)

#615 did not include any test so it is hard to know if we are properly covering the use case, but from discussions we should already be. If it is not the case (e.g. if in sotw), a fix alike #559 will likely be required

AmitKatyal-Sophos · 2023-03-14T02:03:42Z

pkg/server/delta/v3/server.go

@@ -119,26 +119,8 @@ func (s *server) processDelta(str stream.DeltaStream, reqCh <-chan *discovery.De

 			watch := watches.deltaWatches[typ]
 			watch.nonce = nonce
-			// As per XDS protocol, for the non wildcard resources, management server should only respond to the resources


[Query]
I understand that this fix is not required as we are maintaining the separate list of subscribed resources. However, after removing the below change the race condition is still applicable and will result in removing the entry from the resource versionMap, which was already notified to the client ?.

Now during the next setSnapshot, when we are creating the delta response, the below implementation still relies on the GetResourceVersions() which is invalid(resource which was successfully notified to the client was overwritten) due to race condition. Since GetResourceVersions() is incorrect, we will unnecessarily be sending the resource to the client again even though the resource was already notified, and there was no change in the hash value.

I agree that notifying the resource again is not harmful, but might not be an efficient approach with large configuration. IMO we should address the race condition so that resources which are already notified remains intact in the resourceVersionMap.

@valerian-roche @haoruolei Please correct me if I am wrong in my findings. Thanks.

go-control-plane/pkg/cache/v3/delta.go

Lines 63 to 78 in b8aa9fb

nextVersionMap = make(map[string]string, len(state.GetSubscribedResourceNames()))

// state.GetResourceVersions() may include resources no longer subscribed

// In the current code this gets silently cleaned when updating the version map

for name := range state.GetSubscribedResourceNames() {

prevVersion, found := state.GetResourceVersions()[name]

if r, ok := resources.resourceMap[name]; ok {

nextVersion := resources.versionMap[name]

if prevVersion != nextVersion {

filtered = append(filtered, r)

}

nextVersionMap[name] = nextVersion

} else if found {

toRemove = append(toRemove, name)

}

}

}

Hi

Can you provide a test case for the aforementioned race condition?
It is also not clear to me where you expect this race to occur? Is it that a new request is handled prior to the first response, and its response might be created prior to the previous one being returned?

@valerian-roche
Yes, the race condition was because the new request from client is handled prior to the first response, as a result the resourceVersionMap was not having the correct subscription list (as resourceVersionMap was being used to maintain the subscription list).

As you are maintaining the separate map for the subscription list, the above race condition will not overwrite or corrupt the subscription list, but the race condition still exist and will corrupt/overwrite the hash of the resource which is already notified to the client.

Can you provide a test showing the issue?
From my understanding, you have:

a first request subscribes to resource A. At this stage the subscribed list is set to [A], and the version map is empty

a response for this request is being built and has been queued for replying, but at this stage a second request is processed, adding a subscription for B. Now the subscribed list is [A, B] and the version map is still empty

the response for the first request is sent. You get an update for A. The version map is now [A]

the response for the second request is sent. You get an update for A and B and the version map is now [A, B]. The response for the second request was processed prior to the version map being updated for the first request, so A is notified twice when it was not needed
I am not fully clear what you mean by "corrupted" here, as the version map should be good in the end. The xDS protocol clearly states that the control-plane can send updates of resources which have not changed or have not been explicitly subscribed to: in https://www.envoyproxy.io/docs/envoy/latest/api-docs/xds_protocol#how-the-client-specifies-what-resources-to-return:

Normally (see below for exceptions), requests must specify the set of resource names that the client is interested in. The management server must supply the requested resources if they exist. The client will silently ignore any supplied resources that were not explicitly requested. When the client sends a new request that changes the set of resources being requested, the server must resend any newly requested resources, even if it previously sent those resources without having been asked for them and the resources have not changed since that time. If the list of resource names becomes empty, that means that the client is no longer interested in any resources of the specified type.

The opposite case where we would not be sending an update for a resource that was removed then readded is a big part of why the logic is like this, as this will eventually converge to the latest version being sent for all requested resources. The change in #615 is potentially creating this case, as the version map no longer reflects what is the last objects returned (an object removed then re-added while still at the same version will not be sent). This is highly problematic in the case of delta-eds and cds updates that can create this specific setup

I understand that envoy protocol is eventual consistency. The data will not be corrupted with a separate subscription list and resourceversion map. The logic above always compares the resource in the cache with the resource map. So even if responses are not in the order of requests, in the end, the newest version will be recorded in the version map and sent to the client?

Yes the current code ensures that the latest response will be run against the latest subscription set and with the version map latest set by a response. In a case where envoy would send very quickly subscribe to something, then remove, then re-subscribe, there is a chance that we would not send again the resources if all three are processed before a response is first sent. This can happen due to the queueing at https://github.com/envoyproxy/go-control-plane/blob/main/pkg/server/delta/v3/server.go#L198 not guaranteeing how many responses can be pending prior to the select running, though realistically I would not expect this case to occur with more than one pending response in the general case (as the queueing time would have to be longer than the request processing time).
A known remaining issue is when a request reports a NACK. I have another PR to properly identify what has been returned and what has been acknowledged (which can be very different in some cases), but it is pending on the rework of the Cache interface

Can you elaborate on the sub, remove, re-sub case above. I'm not sure I follow what could go wrong.

The issue is that the VersionMap used to build a response is set when the last response was sent.

First request subscribing to A and B comes in and gets a response. Subscriptions are [A, B] and version map is [A, B]

Second request unsubscribing to B comes in. Subscriptions are A, and this will build a response for A. The response for A ends in the dequeue/requeue goroutine

Another request came is resubscribing to B. Subscriptions are [A, B], and the response is built with the version map still set as [A, B]. There is no change so the watch gets created

The response for the second request is sent, and now the version map is [A] and the subscriptions are [A, B] with the watch waiting for any cache change. B is still at the same version as when initially sent (otherwise it would be resent and we're good), but B was not sent again, which violates the xds protocol. That might be addressable by forcing the version to "" as done in https://github.com/envoyproxy/go-control-plane/blob/main/pkg/server/delta/v3/server.go#L265 outside of the wildcard case, but that will likely require more work

To be clear this will only happen if envoy sends both requests in a very short timeframe (~ms), compared to the current state after 615 when it will occur every time, which is fully breaking delta-eds

Overall I'm not really willing to change too much prior to #584 and #586 which are explicitly designed to fix this case and require far more logic update

Thanks for the explanation. This seems a rare scenario but looking forward to the future fix work! BTW do you mean that setting version to "" when subscribing to a new resource?

Can you provide a test showing the issue? From my understanding, you have:

a first request subscribes to resource A. At this stage the subscribed list is set to [A], and the version map is empty

a response for this request is being built and has been queued for replying, but at this stage a second request is processed, adding a subscription for B. Now the subscribed list is [A, B] and the version map is still empty

the response for the first request is sent. You get an update for A. The version map is now [A]

the response for the second request is sent. You get an update for A and B and the version map is now [A, B]. The response for the second request was processed prior to the version map being updated for the first request, so A is notified twice when it was not needed
I am not fully clear what you mean by "corrupted" here, as the version map should be good in the end. The xDS protocol clearly states that the control-plane can send updates of resources which have not changed or have not been explicitly subscribed to: in https://www.envoyproxy.io/docs/envoy/latest/api-docs/xds_protocol#how-the-client-specifies-what-resources-to-return:

Normally (see below for exceptions), requests must specify the set of resource names that the client is interested in. The management server must supply the requested resources if they exist. The client will silently ignore any supplied resources that were not explicitly requested. When the client sends a new request that changes the set of resources being requested, the server must resend any newly requested resources, even if it previously sent those resources without having been asked for them and the resources have not changed since that time. If the list of resource names becomes empty, that means that the client is no longer interested in any resources of the specified type.

The opposite case where we would not be sending an update for a resource that was removed then readded is a big part of why the logic is like this, as this will eventually converge to the latest version being sent for all requested resources. The change in #615 is potentially creating this case, as the version map no longer reflects what is the last objects returned (an object removed then re-added while still at the same version will not be sent). This is highly problematic in the case of delta-eds and cds updates that can create this specific setup

@valerian-roche You are correct, with your new changes of separate subscription list, the version Map at the end will be correct. It is just that client sending requests in very short span would result in resending the resource, and as you said that it should be OK to resend. Infact resending the same object again was the behaviour with even #615 as well. I think we should be good to remove the changes done as part of #615.

…mping in binaries Signed-off-by: Valerian Roche <valerian.roche@datadoghq.com>

This reverts commit 0e0f25d. Signed-off-by: Valerian Roche <valerian.roche@datadoghq.com>

haoruolei · 2023-03-15T21:13:33Z

BTW do you happen to know when the next go-control-plane release is planned for? I assume this revert will be included.

valerian-roche · 2023-03-16T20:36:15Z

BTW do you happen to know when the next go-control-plane release is planned for? I assume this revert will be included.

@alecholmez is cutting releases, so I would defer to him on this one

alecholmez

I'm good with merging this. Just caught up on the PR talk.

Also we don't have a release schedule per say but I hope to accomplish the multi-module release soon.

haoruolei · 2023-03-20T17:28:51Z

I'm good with merging this. Just caught up on the PR talk.

Also we don't have a release schedule per say but I hope to accomplish the multi-module release soon.

Thanks for the info. Hopefully this can be released soon

valerian-roche mentioned this pull request Mar 13, 2023

Rework Cache interface #584

Open

AmitKatyal-Sophos reviewed Mar 14, 2023

View reviewed changes

valerian-roche added 2 commits March 14, 2023 15:56

Set safe directory in CI script to fix build issue related to VCS sta…

e7c1de7

…mping in binaries Signed-off-by: Valerian Roche <valerian.roche@datadoghq.com>

Revert "Fix for the SDS update failure (envoyproxy#615)"

9345fba

This reverts commit 0e0f25d. Signed-off-by: Valerian Roche <valerian.roche@datadoghq.com>

valerian-roche force-pushed the vr/revert-615 branch from f5634b1 to 9345fba Compare March 14, 2023 20:00

AmitKatyal-Sophos approved these changes Mar 14, 2023

View reviewed changes

haoruolei approved these changes Mar 15, 2023

View reviewed changes

alecholmez approved these changes Mar 20, 2023

View reviewed changes

alecholmez merged commit 5efe59d into envoyproxy:main Mar 20, 2023

valerian-roche mentioned this pull request Mar 20, 2023

Set safe directory in CI script to fix build issue related to VCS stamping in binaries #662

Closed

valerian-roche deleted the vr/revert-615 branch March 20, 2023 14:29

quyenhoang96 mentioned this pull request Mar 27, 2023

EG Infinite loop until kill pod envoyproxy/gateway#1214

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert "Fix for the SDS update failure (#615)" as no longer needed on top of #559 #657

Revert "Fix for the SDS update failure (#615)" as no longer needed on top of #559 #657

valerian-roche commented Mar 13, 2023

AmitKatyal-Sophos Mar 14, 2023

valerian-roche Mar 14, 2023

AmitKatyal-Sophos Mar 14, 2023

valerian-roche Mar 14, 2023

haoruolei Mar 14, 2023

valerian-roche Mar 14, 2023

haoruolei Mar 14, 2023

valerian-roche Mar 14, 2023 •

edited

haoruolei Mar 14, 2023

AmitKatyal-Sophos Mar 14, 2023 •

edited

haoruolei commented Mar 15, 2023

valerian-roche commented Mar 16, 2023

alecholmez left a comment •

edited

haoruolei commented Mar 20, 2023

	nextVersionMap = make(map[string]string, len(state.GetSubscribedResourceNames()))
	// state.GetResourceVersions() may include resources no longer subscribed
	// In the current code this gets silently cleaned when updating the version map
	for name := range state.GetSubscribedResourceNames() {
	prevVersion, found := state.GetResourceVersions()[name]
	if r, ok := resources.resourceMap[name]; ok {
	nextVersion := resources.versionMap[name]
	if prevVersion != nextVersion {
	filtered = append(filtered, r)
	}
	nextVersionMap[name] = nextVersion
	} else if found {
	toRemove = append(toRemove, name)
	}
	}
	}

Revert "Fix for the SDS update failure (#615)" as no longer needed on top of #559 #657

Revert "Fix for the SDS update failure (#615)" as no longer needed on top of #559 #657

Conversation

valerian-roche commented Mar 13, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

valerian-roche Mar 14, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmitKatyal-Sophos Mar 14, 2023 • edited

Choose a reason for hiding this comment

haoruolei commented Mar 15, 2023

valerian-roche commented Mar 16, 2023

alecholmez left a comment • edited

Choose a reason for hiding this comment

haoruolei commented Mar 20, 2023

valerian-roche Mar 14, 2023 •

edited

AmitKatyal-Sophos Mar 14, 2023 •

edited

alecholmez left a comment •

edited