Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CPU usage #1954

Open
Pionerd opened this issue Mar 27, 2024 · 20 comments
Open

High CPU usage #1954

Pionerd opened this issue Mar 27, 2024 · 20 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/backlog Higher priority than priority/awaiting-more-evidence. target/kubernetes Issues relating to kubernetes cluster scanning

Comments

@Pionerd
Copy link

Pionerd commented Mar 27, 2024

What steps did you take and what happened:

Trivy Operator is using near 100% of CPU all the time. In debug mode the logs are flooded with messages like this (not sure if the operator is supposed to check this often) but for the rest I do not see anything significant happening

2024-03-27T17:46:29Z    DEBUG    reconciler.ttlreport    RequeueAfter    {"report": {"name":"replicaset-external-secrets-webhook-cd4cbb65c","namespace":"external-secrets"}, "durationToTTLExpiration": "23h52m42.140754071s"}
2024-03-27T17:46:29Z    DEBUG    resourcecontroller    Checking whether configuration audit report exists    {"kind": "ReplicaSet", "name": {"name":"oauth2-proxy-5d9f94db5","namespace":"oauth2-proxy"}}
2024-03-27T17:46:30Z    DEBUG    reconciler.ttlreport    RequeueAfter    {"report": {"name":"daemonset-kured","namespace":"kured"}, "durationToTTLExpiration": "23h51m41.812625045s"}
2024-03-27T17:46:30Z    DEBUG    resourcecontroller    Checking whether configuration audit report exists    {"kind": "DaemonSet", "name": {"name":"prometheus-stack-prometheus-node-exporter","namespace":"prometheus-stack"}}
2024-03-27T17:46:31Z    DEBUG    reconciler.ttlreport    RequeueAfter    {"report": {"name":"replicaset-oauth2-proxy-5d9f94db5","namespace":"oauth2-proxy"}, "durationToTTLExpiration": "23h52m42.795478015s"}
2024-03-27T17:46:31Z    DEBUG    resourcecontroller    Checking whether configuration audit report exists    {"kind": "ReplicaSet", "name": {"name":"thanos-query-frontend-7bb9bfdfd","namespace":"prometheus-stack"}}
2024-03-27T17:46:31Z    DEBUG    reconciler.ttlreport    RequeueAfter    {"report": {"name":"daemonset-prometheus-stack-prometheus-node-exporter","namespace":"prometheus-stack"}, "durationToTTLExpiration": "23h51m36.746551987s"}
2024-03-27T17:46:31Z    DEBUG    resourcecontroller    Checking whether configuration audit report exists    {"kind": "DaemonSet", "name": {"name":"grafana-loki-promtail","namespace":"grafana-loki"}}
2024-03-27T17:46:32Z    DEBUG    reconciler.ttlreport    RequeueAfter    {"report": {"name":"daemonset-grafana-loki-promtail","namespace":"grafana-loki"}, "durationToTTLExpiration": "23h51m37.6956737s"}
2024-03-27T17:46:32Z    DEBUG    resourcecontroller    Checking whether configuration audit report exists    {"kind": "DaemonSet", "name": {"name":"kured","namespace":"kured"}}

The average workload is recurring every ~3/4 seconds

What did you expect to happen:
No high CPU usage in an idle situation

Anything else you would like to add:

[Miscellaneous information that will assist in solving the issue.]

Environment:

  • Trivy-Operator version (use trivy-operator version): 0.19.1
  • Kubernetes version (use kubectl version): 1.28 AKS
  • OS (macOS 10.15, Windows 10, Ubuntu 19.10 etc): Ubuntu
@Pionerd Pionerd added the kind/bug Categorizes issue or PR as related to a bug. label Mar 27, 2024
@chen-keinan
Copy link
Collaborator

@Pionerd have you tried setting resources for operator

@chen-keinan chen-keinan added priority/backlog Higher priority than priority/awaiting-more-evidence. target/kubernetes Issues relating to kubernetes cluster scanning labels Mar 28, 2024
@Pionerd
Copy link
Author

Pionerd commented Mar 28, 2024

Yes, resources are set. CPU usage is constantly close to whatever limit I set, up to 4 CPUs. Memory does not seem a problem.

I don't mind if the app needs to use CPU (usually we don't set CPU limits at all) but from the logs, to me it is not clear what the Operator is doing that it needs so much.

@Pionerd
Copy link
Author

Pionerd commented Apr 2, 2024

I do observe all Clusterrbacassessmentreports being recreated constantly, none of them has an age of over 5 minutes

@chen-keinan
Copy link
Collaborator

chen-keinan commented Apr 2, 2024

@Pionerd reports are created when resource is created/updated or reportTTL exceeded or report is deleted, any of it related to your case ?

@Pionerd
Copy link
Author

Pionerd commented Apr 2, 2024

No, none of those. It applies to literrally all ClusterRBACAssessmentReports. reportTTL is the chart's default of 24h, I don't see this behaviour e.g. for ConfigAuditReports.

As mentioned above, also the logs (in DEBUG mode) are flooded with messages related to this:

2024-04-02T12:42:51Z    DEBUG    resourcecontroller    Checking whether configuration audit report exists    {"kind": "ClusterRole", "name": {"name":"system:node-problem-detector"}}
2024-04-02T12:42:52Z    DEBUG    resourcecontroller    Checking whether configuration audit report exists    {"kind": "ClusterRole", "name": {"name":"system:controller:replication-controller"}}
2024-04-02T12:42:53Z    DEBUG    resourcecontroller    Checking whether configuration audit report exists    {"kind": "ClusterRole", "name": {"name":"system:controller:resourcequota-controller"}}
etc.
etc.

@chen-keinan
Copy link
Collaborator

@Pionerd the check for report is a part of processing, do you see report get generated e every few minutes ?

@Pionerd
Copy link
Author

Pionerd commented Apr 2, 2024

Yes

@chen-keinan
Copy link
Collaborator

@Pionerd few of question:

  • are the generated reports same as existing ?
  • are reports get deleted while new one generated ?

@Pionerd
Copy link
Author

Pionerd commented Apr 2, 2024

Yes, they are exactly the same. The reports are deleted and then it takes up to a minute for them to reappear.

@chen-keinan
Copy link
Collaborator

Yes, they are exactly the same. The reports are deleted and then it takes up to a minute for them to reappear.
it make sense, as deletion of cluster ClusterRbacAssessmentReport report deletion will trigger a new ClusterRole scan.
the only question is what is causing the deletion of the report.
it can't be TTL as TTL do not reconcile ClusterRbacAssessmentReport only RbacAssessmentReport could you think of on some outside process which delete the ClusterRbacAssessmentReport or the ClusterRole

@Pionerd
Copy link
Author

Pionerd commented Apr 2, 2024

If I scale down the deployment of the operator to 0 replicas, the ClusterRbacAssessmentReports remain constant (deletion stops). This makes the operator itself still a suspect to me.

This happens to all ClusterRoles, including all system ones, so I can't imagine how they could be temporarily deleted without causing all other kinds of symptoms in the system.

For your info: this is the same cluster as #1970 and even though InfraAssessments are enabled, they are not generated. So a few weird things seem to be going on.

@chen-keinan
Copy link
Collaborator

@Pionerd Thanks for this update. I'll have a look regarding replicaset >0.
regarding infraassessment reports on AKS probably will not be supported as cloud provider do not allow access to api-server, controller-manager. scheduler and etc

@Pionerd
Copy link
Author

Pionerd commented Apr 17, 2024

@chen-keinan I think I found the culprit. Based on #1742 I wrote a pretty extensive exception set. Is it possible that the operator does not handle this in a very cpu friendly way? Any possibilities to improve this (or the exception definitions in general)?

I'd rather not share the whole exception definition here (it shows our full stack), but I can share it with you if it helps. It contains 26 policy.ksvXXX_exclude_resources.rego blocks, each with one or more exception[rules] blocks.

@chen-keinan
Copy link
Collaborator

@chen-keinan I think I found the culprit. Based on #1742 I wrote a pretty extensive exception set. Is it possible that the operator does not handle this in a very cpu friendly way? Any possibilities to improve this (or the exception definitions in general)?

I'd rather not share the whole exception definition here (it shows our full stack), but I can share it with you if it helps. It contains 26 policy.ksvXXX_exclude_resources.rego blocks, each with one or more exception[rules] blocks.

sure, but you'll have to put here some real example so I could try to reproduce it

@Pionerd
Copy link
Author

Pionerd commented Apr 17, 2024

Please see your gmail.

@chen-keinan
Copy link
Collaborator

Please see your gmail.

ok , got it now, I will take a look at it later

@Pionerd
Copy link
Author

Pionerd commented Apr 19, 2024

Unfortunately the issue reappeared in an environment without those exclusions. Would be really nice if we could make some progress on this issue, please let me know if and how I may assist.

@chen-keinan
Copy link
Collaborator

chen-keinan commented Apr 21, 2024

@Pionerd can you try disable scanner by scanner to isolate the problem and see which one could cause performance issue ?

@Pionerd
Copy link
Author

Pionerd commented Apr 22, 2024

configAuditScannerEnabled: false removes the symptoms.

Based on the above I did an additional test with enabling the configAudit, but disabling the exceptions sent to you earlier. Although the CPU usage is also high for the first few minutes, after that the usage seems to return to normal. Maybe you can look if you can reproduce the issue using those exceptions.

@chen-keinan
Copy link
Collaborator

@Pionerd thanks for input, I'll have a look at it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/backlog Higher priority than priority/awaiting-more-evidence. target/kubernetes Issues relating to kubernetes cluster scanning
Projects
None yet
Development

No branches or pull requests

2 participants