Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

system.processes.count wildly inaccurate #31565

Closed
eric1234 opened this issue Mar 5, 2024 · 5 comments
Closed

system.processes.count wildly inaccurate #31565

eric1234 opened this issue Mar 5, 2024 · 5 comments
Labels
bug Something isn't working receiver/hostmetrics

Comments

@eric1234
Copy link

eric1234 commented Mar 5, 2024

Component(s)

receiver/hostmetrics

What happened?

Description

The total process returned by the processes scraper in hostmetrics is wildly inaccurate.

This is due to gopsutil returning both process count and thread count when otel is just expecting process counts. I have filed a bug report with gopsutil. Depending on how gopsutil fixes this struct will determine how otel should proceed:

  • Maybe they find a way to get the process count via a single syscall from the /proc filesystem and otel remains unchanged. AFAIK there is no way so I don't think this is likely.
  • They might employ the same initial algorithm that otel uses (count directories that are numeric under /proc). This makes it no longer a single syscall and no better than what otel is already doing. We could simplify the otel code by just delegating it all or keep our version and just remove the call to get MiscStat.
  • They may choose to remove that item from the struct so we will just need to remove our use of it and accept the problems with our existing way of counting processes.

We might also prefer to for now take that last option so the metric isn't so wildly inaccurate. I.E. it might be a little short but not several times larger. Then once gopsutil decided we can reconsider.

Steps to Reproduce

  1. Run the hostmetrics receiver with the processes scraper enabled.
  2. Compare the number returned (especially in the Unknown category) with ps -A | wc -l

Expected Result

I expect the number returned by hostmetrics to be similar to the number returned by the ps command.

Actual Result

The process count I see from hostmetrics is often 2-3 times what ps reports.

Collector version

0.94.1

Environment information

Pop!_OS 22.04 - Tried running in docker container and on host

OpenTelemetry Collector configuration

receivers:
  hostmetrics:
    scrapers:
      processes:

processors:
  batch:

exporters:
  debug:
    verbosity: detailed

service:
  pipelines:
    metrics:
      receivers:
      - hostmetrics
      processors:
      - batch
      exporters:
      - debug

Log output

From `debug` exporter:

...
NumberDataPoints #2
Data point attributes:
     -> status: Str(unknown)
StartTimestamp: 2024-03-05 01:40:08 +0000 UTC
Timestamp: 2024-03-05 03:08:58.186972935 +0000 UTC
Value: 725
...

From `ps`:

$ ps -A | wc -l
307

Additional context

No response

@eric1234 eric1234 added bug Something isn't working needs triage New item requiring triage labels Mar 5, 2024
Copy link
Contributor

github-actions bot commented Mar 5, 2024

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@atoulme
Copy link
Contributor

atoulme commented Mar 6, 2024

This report might also join with the work on deprecating the processesscraper: #30895

@eric1234
Copy link
Author

Some update on this. Discussed it with the gopsutil maintainer and we both think counting the numeric directories in /proc is the really only way which is what this project already does internally.

I had initially dismissed this as a strategy for the fix in gopsutil because this project's code says that's not accurate due to being multiple system calls. But for just a simple count (not status) it is still one syscall.

Regarding what this project should do is an open question. It could just update to the latest version of gopsutil (once my fix is merged) and that library call will now have accurate numbers. But it seems weird to me that we are counting the directories twice. Once in this project and once by that library call. I'm not sure there is value.

The above warning indicates it is necessary to get an accurate count. But the count could still be inaccurate. If I had a running process while the status was being collected but it exited before the metric was finalized then the number would be too high. I think it's better without the extra count. That way it's returning the processes at the time we started the metric. We may not know the status of some of those since they exited and the count may be off from what it is by the time the metric is actually emitted but right now it's this weird mix of the count at two different times.

@eric1234
Copy link
Author

Just to follow up. My fix did get merged.

I did a test build of the otel collector with my fix to gopsutil and it now returns much more reasonable number. Once the next release of gopsutil is made we could potentially just update and have much better numbers.

It's still an open question if counting the numeric directories in /proc twice is the right approach.

@atoulme atoulme removed the needs triage New item requiring triage label Mar 30, 2024
@eric1234
Copy link
Author

I'm going to go ahead and close this. Earlier this month a bot updated the contrib repo to the latest version of gopsutil which contains my fix in that library. I still think it's not right that we are basically iterating the /proc pids twice and sampling data essentially at two different times to produce one metric. But at this point at least it's not wildly inaccurate and it's a bit of a judgement call if the current code is correct or not so as far as I am concerned it's resolved well enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working receiver/hostmetrics
Projects
None yet
Development

No branches or pull requests

2 participants