-
Notifications
You must be signed in to change notification settings - Fork 19.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
p2p/discover: improved node revalidation #29572
Conversation
p2p/discover/table_reval.go
Outdated
|
||
if !resp.didRespond { | ||
// Revalidation failed. | ||
n.livenessChecks /= 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe use / 3
Co-authored-by: Martin HS <martin@swende.se>
This is to better reflect their purpose. The previous naming of 'seen' and 'verified' was kind of arbitrary, especially since 'verified' was the stricter one.
p2p/discover/node.go
Outdated
AddedTable time.Time `json:"addedToTable"` | ||
AddedBucket time.Time `json:"addedToBucket"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The node
has addedToTable
field, and the json tag also contains a to
, so it looks a bit odd that the field name itself is missing the to
, as in AddedToTable
/AddedToBucket
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it's an inconsistency. I will fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, should probably merge and see how it looks live.
In #29572, I assumed the revalidation list that the node is contained in could only ever be changed by the outcome of a revalidation request. But turns out that's not true: if the node gets removed due to FINDNODE failure, it will also be removed from the list it is in. This causes a crash. The invariant is: while node is in table, it is always in exactly one of the two lists. So it seems best to store a pointer to the current list within the node itself.
It seems the semantic differences between addFoundNode and addInboundNode were lost in #29572. My understanding is addFoundNode is for a node you have not contacted directly (and are unsure if is available) whereas addInboundNode is for adding nodes that have contacted the local node and we can verify they are active. handleAddNode seems to be the consolidation of those two methods, yet it bumps the node in the bucket (updating it's IP addr) even if the node was not an inbound. This PR fixes this. It wasn't originally caught in tests like TestTable_addSeenNode because the manipulation of the node object actually modified the node value used by the test. New logic is added to reject non-inbound updates unless the sequence number of the (signed) ENR increases. Inbound updates, which are published by the updated node itself, are always accepted. If an inbound update changes the endpoint, the node will be revalidated on an expedited schedule. Co-authored-by: Felix Lange <fjl@twurst.com>
Node discovery periodically revalidates the nodes in its table by sending PING, checking if they are still alive. I recently noticed some issues with the implementation of this process, which can cause strange results such as nodes dropping unexpectedly, certain nodes not getting revalidated often enough, and bad results being returned to incoming FINDNODE queries.
Let me first describe how the revalidation process worked previously:
livenessChecks
value by one. Since PONG also has the node's ENR sequence number, we request the node's new ENR when it has changed.Now on to issues with the above process:
160 * 5s == 13.3 min
. Note this time applies also to all nodes, even the ones freshly added to the table from a query. It's just too slow to maintain a healthy table.Here is my proposed design for the new revalidation process:
livenessChecks
is incremented by one for successful checks. Unlike the old implementation, we will not drop the node on the first failing check. We instead quickly decay thelivenessChecks
by/ 5
or so to give it another chance.