Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add node health status into config #4758

Merged
merged 1 commit into from
Mar 13, 2025
Merged

Conversation

BorysTheDev
Copy link
Contributor

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
@BorysTheDev BorysTheDev requested a review from adiholden March 12, 2025 15:13
@@ -94,6 +94,12 @@ struct ClusterNodeInfo {
}
};

enum class NodeHealth : std::uint8_t { NONE, FAIL, LOADING, ONLINE };

struct ClusterExtendedNodeInfo : ClusterNodeInfo {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you need this and not just add the new member to ClusterNodeInfo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because ClusterNodeInfo is used for migration

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Ok

@@ -149,21 +149,24 @@ ClusterShardInfo ClusterFamily::GetEmulatedShardInfo(ConnectionContext* cntx) co
? static_cast<uint16_t>(absl::GetFlag(FLAGS_port))
: cluster_announce_port;

info.master = {.id = id_, .ip = preferred_endpoint, .port = preferred_port};
info.master = {{.id = id_, .ip = preferred_endpoint, .port = preferred_port}, NodeHealth::NONE};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not online here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure regarding online for emulated cluster so I decided to make NONE, when I will do the next step I will have better understanding what we need and it will be easy to find this place and refactor

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

today we alway reply with online for all nodes in emulated mode, so unless we want to do any change there you should use online here

info.replicas.push_back({{.id = replica.id,
.ip = replica.address,
.port = static_cast<uint16_t>(replica.listening_port)},
NodeHealth::NONE});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to support this for cluster emulated mode we need here to check what is the status of the replica. i.e if its in loading state.
@romange do we want do this for emulated mode as well?

@adiholden
Copy link
Collaborator

@BorysTheDev LGTM.
Also please add this to the design doc and ping @andydunstall on the changes
Thanks

@BorysTheDev BorysTheDev merged commit 8d6a184 into main Mar 13, 2025
10 checks passed
@BorysTheDev BorysTheDev deleted the feat_node_health_status branch March 13, 2025 07:20
auto health = json.at_or_null("health");
if (!health.is_null()) {
if (!health.is_string()) {
LOG(WARNING) << kInvalidConfigPrefix << "invalid health status for node " << json;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BorysTheDev Why does this only warn, but not return an error? (same below)

From the control plane, we'd rather have an error response to the cluster config command rather than silently fail (with a warning we don't monitor)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We removed the most part of the errors according to your request

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this code doesn't affect the stable work of Dragonfly and isn't critical

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We removed the most part of the errors according to your request

What request?

If the control plane ever sends invalid configuration I think it's better to just fail the request, then we'll be notified and can investigate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants