Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable memory monitoring in CS #391

Merged
merged 3 commits into from Nov 22, 2023
Merged

Conversation

yawangwang
Copy link
Collaborator

@yawangwang yawangwang commented Nov 16, 2023

Enable memory monitoring in CS (see go/monitoring-enablement-in-hardened-image-1-pager).
This PR includes changes to:

  1. Unmask node-problem-detector.service for hardened image.
  2. Override the default config file system-stats-monitor.json to collect memory/bytes_used metrics only for CS.
  3. Configure preload.sh to allow hardened/debug images have this change.
  4. Add a library systemctl to use dbus API to start node-problem-detector.service.
  5. Introduce tee.launch_policy.monitoring_memory_allow to launch policy and tee-monitoring-memory-enable to launchSpec.
  6. Add a library nodeproblemdetector which provides methods to override node-problem-detector configurations. This library will not be used for now, but will be used in the future if operators want to opt-in more metrics monitoring options: cpu, disk, etc.
  7. Add image tests to check if memory monitoring is enabled.

@yawangwang yawangwang force-pushed the memory-monitoring branch 2 times, most recently from 6ca4484 to 8f89e86 Compare November 16, 2023 22:46
@yawangwang yawangwang force-pushed the memory-monitoring branch 4 times, most recently from d9dd1d9 to 798579f Compare November 17, 2023 01:22
@yawangwang yawangwang marked this pull request as ready for review November 17, 2023 02:18
}
r.logger.Println("node-problem-detector.service successfully started.")
} else {
r.logger.Println("node-problem-detector.service disabled.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"MemoryMonitoring is disabled" to be consistent

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// a problem daemon in node-problem-detector that collects pre-defined health-related metrics from different system components.
// For now we only consider collecting memory related metrics.
// View the comprehensive configuration details on https://github.com/kubernetes/node-problem-detector/tree/master/pkg/systemstatsmonitor#detailed-configuration-options
type SystemStatsConfig struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this configuration here, I thought system-stats-monitor-cs.json already have the needed configuration?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, your understanding is right - this systemstats_config.go is not used anywhere for now, but the value of having this is for future proof: if customers want to enable other monitoring options in the future (cpu, disk, os etc.,), this file will allow them to override the system-stats-monitor.json on the run time instead of image build time.

progress := make(chan string, 1)

// Run systemd command in "replace" mode to start the unit and its dependencies,
// possibly replacing already queued jobs that conflict w∏ith this.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra character "w∏ith"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@yawangwang
Copy link
Collaborator Author

@jkl73 Added three another config files in addition to system-stats-monitor.json to turn off system-log-monitor and custom-plugin-monitor for node-problem-detector. These two daemon processes were intended to monitor/report kernel logs, docker logs and boot process, but we decided to turn them off.

launcher/internal/systemctl/systemctl.go Outdated Show resolved Hide resolved
launcher/go.mod Outdated Show resolved Hide resolved
return fmt.Errorf("failed to run systemctl [%s] for unit [%s]: %v", cmd, unit, err)
}

if result := <-progress; result != "done" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will block, do you know the timeout for those dbus function?

Also probably should log something in container_runner.go to indicate it is trying to do some systemctl operation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dbus will call the underlying C library. You can check the timeout definitions at https://dbus.freedesktop.org/doc/api/html/group__DBusTimeout.html. The dbus config file usually locates at /usr/share/dbus-1/system.conf with a default timeout 25s.

Added systemctl loggings to container_runner.go.

@yawangwang yawangwang merged commit 38bab91 into google:main Nov 22, 2023
11 checks passed
alexmwu added a commit to alexmwu/go-tpm-tools that referenced this pull request Feb 22, 2024
New Features:
[launcher] Add TEE server IPC implementation google#367
[launcher] Enable memory monitoring in CS google#391
Use TDX quote provider to attest and verify google#405
Integrate nonce verification as part of the TDX quote validation procedure. google#395
Add RISC V support google#407
[launcher] Use resizable integrity-fs with in-memory tags google#412

Bug Fixes:
[launcher] Fix launcher exit code google#384
[launcher] Handle exit code checking during deferral evaluation google#392
[cmd] Skip tests that call setGCEAKTemplate google#402
[launcher] Fix teeserver context reset issue & add container signature cache google#397
Set all unused parameters as _ to fix CI lint failure google#411
[launcher] Make customtoken test sleep to mitigate clock skew google#413

Other Changes:
Add eventlog parse logics for memory monitoring google#404
[launcher]: Add memory monitor measurement logics google#408
Update go-tdx-guest version to v0.3.1 google#414

New Contributors:
@KeithMoyer in google#392
@vbalain in google#405
@aimixsaka in google#407
@alexmwu alexmwu mentioned this pull request Feb 22, 2024
alexmwu added a commit that referenced this pull request Feb 22, 2024
New Features:
[launcher] Add TEE server IPC implementation #367
[launcher] Enable memory monitoring in CS #391
Use TDX quote provider to attest and verify #405
Integrate nonce verification as part of the TDX quote validation procedure. #395
Add RISC V support #407
[launcher] Use resizable integrity-fs with in-memory tags #412

Bug Fixes:
[launcher] Fix launcher exit code #384
[launcher] Handle exit code checking during deferral evaluation #392
[cmd] Skip tests that call setGCEAKTemplate #402
[launcher] Fix teeserver context reset issue & add container signature cache #397
Set all unused parameters as _ to fix CI lint failure #411
[launcher] Make customtoken test sleep to mitigate clock skew #413

Other Changes:
Add eventlog parse logics for memory monitoring #404
[launcher]: Add memory monitor measurement logics #408
Update go-tdx-guest version to v0.3.1 #414

New Contributors:
@KeithMoyer in #392
@vbalain in #405
@aimixsaka in #407
alexmwu added a commit to alexmwu/go-tpm-tools that referenced this pull request Mar 29, 2024
New Features:
[launcher] Add TEE server IPC implementation google#367
[launcher] Enable memory monitoring in CS google#391
Use TDX quote provider to attest and verify google#405
Integrate nonce verification as part of the TDX quote validation procedure. google#395
Add RISC V support google#407
[launcher] Use resizable integrity-fs with in-memory tags google#412

Bug Fixes:
[launcher] Fix launcher exit code google#384
[launcher] Handle exit code checking during deferral evaluation google#392
[cmd] Skip tests that call setGCEAKTemplate google#402
[launcher] Fix teeserver context reset issue & add container signature cache google#397
Set all unused parameters as _ to fix CI lint failure google#411
[launcher] Make customtoken test sleep to mitigate clock skew google#413

Other Changes:
Add eventlog parse logics for memory monitoring google#404
[launcher]: Add memory monitor measurement logics google#408
Update go-tdx-guest version to v0.3.1 google#414

New Contributors:
@KeithMoyer in google#392
@vbalain in google#405
@aimixsaka in google#407
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants