Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics for System.Runtime #85372

Open
JamesNK opened this issue Apr 26, 2023 · 40 comments
Open

Metrics for System.Runtime #85372

JamesNK opened this issue Apr 26, 2023 · 40 comments
Labels
area-System.Diagnostics.Metric enhancement Product code improvement that does NOT require public API changes/additions feature-request
Milestone

Comments

@JamesNK
Copy link
Member

JamesNK commented Apr 26, 2023

Today there are event counters for System.Runtime: https://learn.microsoft.com/en-us/dotnet/core/diagnostics/available-counters#systemruntime-counters

Metrics should be added to this area. Advantages:

  1. Metrics has new features such as tags (allows for dimensions) and histograms.
  2. Easier for tests and libraries to listen to and consume with MeterListener.
  3. Some libraries only support collecting custom counters that use System.Diaganostics.Metrics. For example, opentelemetry-net.

What instruments should we have?

  • Existing System.Runtime event counters provide a good starting place.
    • Some of these counters could be combined by using tags.
    • Or tags could provide more information. For example, should the exception-count counter include the exception type name as a tag? Then tooling can provide a breakdown of not just the total exception count but the exception count grouped by type.
  • OpenTelemetry has specs with conventions for counters that a system provides. We should try to provide the same data. Note that counter and tag names don't need to match. We should use .NET conventions.
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Apr 26, 2023
@JamesNK
Copy link
Member Author

JamesNK commented Apr 26, 2023

@tarekgh
Copy link
Member

tarekgh commented Apr 26, 2023

CC @reyang

@reyang
Copy link

reyang commented Apr 26, 2023

@JamesNK FYI in OpenTelemetry .NET we've implemented these:

Having these in in future versions of the runtime would be awesome, the existing OpenTelemetry instrumentation libraries can do a runtime detection and leverage the runtime instrumentation if it is there. Eventually as old versions of the runtime get deprecated, we'll land in a better situation where we don't need a separate instrumentation library as things are "baked in".

@JamesNK
Copy link
Member Author

JamesNK commented Apr 26, 2023

I think OTel would still have something because .NET counters won't follow the OTel naming standard. However, the implementation should be very simple because built-in counters will provide all the information needed.

@noahfalk
Copy link
Member

@JamesNK is this work you were planning to pursue yourself or just recording the request?

@tommcdon tommcdon added this to the 8.0.0 milestone Apr 26, 2023
@ghost ghost removed the untriaged New issue has not been triaged by the area owner label Apr 26, 2023
@tommcdon tommcdon added enhancement Product code improvement that does NOT require public API changes/additions untriaged New issue has not been triaged by the area owner and removed untriaged New issue has not been triaged by the area owner labels Apr 26, 2023
@JamesNK
Copy link
Member Author

JamesNK commented Apr 26, 2023

Recording the request.

@tarekgh tarekgh modified the milestones: 8.0.0, Future Apr 26, 2023
@tarekgh
Copy link
Member

tarekgh commented Apr 26, 2023

@JamesNK I moved this to future milestone. Please let me know if there is strong demand to have this in .NET 8.0.

@samsp-msft
Copy link
Member

The main reason to implement these as metrics in 8, is so that we can wean people off eventcounters and onto the metrics instead. As these are the main process-wide counters, getting them converted will be a major signal towards that goal.

There are likely few counters that need many dimensions here, as most are process wide. We should evaluate the work in comparison to the infrastructure needed to implement it.

@omajid
Copy link
Member

omajid commented May 3, 2023

Hey folks! I am interested in trying to help implement this.

@noahfalk
Copy link
Member

noahfalk commented May 5, 2023

Hi @omajid, glad to have help! I'm guessing that most of the work on this feature will be investigating design options and trying to get a concensus on the best design rather than writing the implementation code. If that is something you are interested in taking a stab at thats great. If you are interested in having someone else work through the design first thats fine too, but I don't know necessarily when that would occur.

If you did want to pursue the design part, these are the major questions that come to mind right now:

  1. Do we exactly duplicate the System.Runtime EventCounters for maximal compatibility, or are we going to intentionally make some changes?
  2. If we are making changes, what kind of changes?
    • Making use of new instrument types that weren't available before like Histogram?
    • Making use of tags which weren't available before?
    • Re-naming anything we believe was poorly named before?
    • Removing metrics that were potentially confusing?
    • Reorganizing what metrics are in which Meters?
  3. Is the Meter a static singleton, or are we somehow using the new DI Meter work to create per-DI-container instantiations?

My hunch is that, yes, some kinds of changes are going to be appealing but we need to figure out what are the impacts of different kinds of changes, is there anything we can do to make migration easier, and then figure out which changes seem worthwhile. For (3) my guess is that we would make it a static singleton, but we need to figure out how that intersects with DI Meter work and the new Meter config work so there might be stalls in there where one design needs to wait for stuff to resolve in the other, or they have to be resolved simultaneously.

I think there is design inspiration we could take from https://github.com/open-telemetry/opentelemetry-dotnet-contrib/tree/main/src/OpenTelemetry.Instrumentation.Process. For changes/removals to existing counters there is also some past discussion here: #77530

So if all this sounds like something you still want to dive into I think a first step would be to create an initial proposal (in a gist or a PR'ed markdown doc) describing what instruments would be exposed. Thanks!

@omajid
Copy link
Member

omajid commented May 5, 2023

Hey, @noahfalk! Thanks for the various links. These are great questions.

I have been prototyping an implementation and I came up with some similar questions (and some possible answers to what you asked). I wouldn't mind helping with the design, though I am not a runtime or OpenTelemetry expert. Advice from anyone more familiar with this is more than welcome.

I have been looking at OpenTelemetry's docs as a great starting point from which to evaluate design ideas.

Do we exactly duplicate the System.Runtime EventCounters for maximal compatibility, or are we going to intentionally make some changes?

I think if we are creating a Metrics based implementation for first-class support for OpenTelemetry, we should take advantage of that and provide similar (or additional) information, but in a way that is easier to consume and/or feels more natural for anyone looking to consume it via an OpenTelemetry-compatible tool.

The opentelemetry-dotnet-contrib docs almost match the existing EventCounters of System.Runtime, with a few differences.

If we are making changes, what kind of changes?

  • Making use of new instrument types that weren't available before like Histogram?

This isn't currently listed as something used in the OpenTelemetry docs, and isn't done in the EventCounter implementation either. So I think we can pass on this for a first stab? If we find some good use cases, we should consider using adding Histograms for those.

  • Making use of tags which weren't available before?
  • Re-naming anything we believe was poorly named before?

Yes. In fact, I think we have to. Otherwise we provide OpenTelemetry-compatible metrics but violate all assumptions in the ecosystem, making things harder to parse and use. For example, all our metrics via EventCounters have a single name, but OpenTelemetry expects metrics to be namespaced via dots:

Associated metrics SHOULD be nested together in a hierarchy based on their usage. Define a top-level hierarchy for common metric categories: for OS metrics, like CPU and network; for app runtimes, like GC internals. Libraries and frameworks should nest their metrics into a hierarchy as well. This aids in discovery and adhoc comparison. This allows a user to find similar metrics given a certain metric.

There's also prior art in the form of https://github.com/open-telemetry/opentelemetry-dotnet-contrib/tree/main/src/OpenTelemetry.Instrumentation.Runtime which does a great job creating a hierarchy, in the form of process.runtime.dotnet.{gc,jit,memory,etc}.

Though I see that https://github.com/dotnet/runtime/pull/85447/files does things differently and I am not sure why.

Removing metrics that were potentially confusing?

I think we should? We aren't putting our users on a path to success by putting thing that are confusing or likely to be misinterpreted in. Specially in a new/fresh design.

I also think we should leave out various -rate (or -per-second) metrics. That's something the OpenTelemetry tooling should compute/handle from the raw data.

Reorganizing what metrics are in which Meters?

I hadn't really thought about this.

The current design (eg, looking at the output of dotnet counters) uses the Meter name for namespacing instead of the name of the Instrument. From skimming the OpenTelemetry docs, the convention appears to be to use Meter names for scope. As a non-expert, it's hard for me to understand the scope of, for example, System.Net.Http vs Microsoft-AspNetCore-Server-Kestrel vs System.Net.Sockets. There's a lot of overlap in terms of bytes/handshakes/connections/requests that all 3 seem to track to some extent.

Is the Meter a static singleton, or are we somehow using the new #77514 to create per-DI-container instantiations?

This shouldn't matter from a usage point of view, right? Could we make a static signleton for now and later switch to DI without breaking users?

@noahfalk
Copy link
Member

noahfalk commented May 6, 2023

I wouldn't mind helping with the design, though I am not a runtime or OpenTelemetry expert. Advice from anyone more familiar with this is more than welcome.

No worries. I think its fine to toss out ideas and then get feedback on it. If we need folks with certain areas of expertise we'll try to find them. Ultimately if there is no consensus forming and a contentious decision needs to be made I can make it.

[Refering to histograms]
This isn't currently listed as something used in the OpenTelemetry docs, and isn't done in the EventCounter implementation either. So I think we can pass on this for a first stab?

If we add a histogram in the future where there previously the Meter has no similar instrument defined, that seems straightforward and easy to postpone. What feels less straightforward would be adding a counter or gauge now, then later deciding it would have been better to define that instrument as a histogram. For example we might propose an ObservableGauge that was an average GC pause duration, then later we think oops, maybe that should have been a gc pause duration histogram instead.
So, yes we could pass on it if need to as long as we are careful not to box ourselves into a corner.

Otherwise we provide OpenTelemetry-compatible metrics but violate all assumptions in the ecosystem, making things harder to parse and use. For example, all our metrics via EventCounters have a single name but OpenTelemetry expects metrics to be namespaced via dots:

A common pattern that has arisen with the OTel work is that .NET will have some pre-existing convention or naming scheme, then OTel defines a new scheme that isn't consistent. No matter which one we choose to use it will always be inconsistent with something, either inconsistent with OTel recommendations or inconsistent with a .NET developer's past experience of the platform. When this happens we try to make a judgement call about which behavior more .NET developers are going to prefer in the long run, and often we do wind up favoring .NET self-consistency instead of OTel consistency. The pattern we've landed on in other places (example) is that we are staying consistent with .NET metric naming convention rather than switching to OTel naming convention. I'm expecting we'd do the same here. For folks who want something that conforms tightly to OTel naming and semantic conventions, the instrumentation packages from OTel such as https://github.com/open-telemetry/opentelemetry-dotnet-contrib/tree/main/src/OpenTelemetry.Instrumentation.Runtime package better fill that role right now. I expect what we'll want to build fairly soon (but not as part of this PR) are mechanisms that make schema conversion very easy so that users can get the data into whatever shape they need.

There's also prior art in the form of https://github.com/open-telemetry/opentelemetry-dotnet-contrib/tree/main/src/OpenTelemetry.Instrumentation.Runtime

Whoops, that is actually the one I meant to link to above rather than Process. Glad you found it anyways :)

Though I see that https://github.com/dotnet/runtime/pull/85447/files does things differently and I am not sure why.

Above I mentioned how we can't be consistent with both past-precedent and with OTel so we had to make a choice. As to why we choose this way, a few reasons:

  • For many .NET users, Meters are simply an updated API used for metrics. Although we designed the API in coordination with OpenTelemetry, that doesn't mean OpenTelemetry is the only scenario that makes use of them and users may not even be aware of OpenTelemetry. For example Meters can also be used with Prometheus.NET, dotnet-counters, and dotnet-monitor. So it isn't automatically going to follow for many users that changing an API also means all the past conventions need to change too.
  • The OTel semantic conventions are currently all marked Experimental and they have been that way a long time. Choosing to use an experimental standard in a platform that tries to maintain strong back-compat is risky because the specification could change after we've already shipped the experimental spec. If our built-in metrics follow a noticably different convention then users don't expect that the data uses OTel schema and they know to look around for a OTel convention instrumentation or a conversion option if that is what they need. However if the built-in metrics look pretty close users probably assume that it exactly implements the convention and get aggravated when more subtle discrepancies emerge.
  • As a platform we need to be able to add metrics without waiting for OTel to define a convention for a given scenario first. Its awkward if OTel adds a convention later, chooses different semantics than what we picked, and then retroactively developers are aggravated the platform data matches OTel conventions in some places but not others. I can imagine versioning schemes might help resolve this scoping issue in the future, but from what I am aware of the OTel schema versioning mechanisms are also experimental right now.
  • Even though OTel is a gaining a good deal of mind-share, there are a variety of other standards and conventions out there, including private ones used within a particular company or project. We expect no matter what we start with there will be a desire for conversions for a long time to come, so we are better off leaving our default conventions as-is and investing in conversion mechanisms.
  • The OTel.NET project is providing instrumentation that follows OTel conventions closely so .NET devs that want that option do have it.

I also think we should leave out various -rate (or -per-second) metrics. That's something the OpenTelemetry tooling should compute/handle from the raw data.

Yeah, that sounds pretty reasonable to me as well.

Reorganizing what metrics are in which Meters?
I hadn't really thought about this.

I think the one place this came up in the past was in the discussion of GC related metrics. One set of customers will ask for fairly detailed metrics in a specific area, but I worried that if we add too much to the runtime Meter it will be confusing for users who have simple needs. I think direction things were going though is that we shouldn't worry too much about adding more detailed metrics as long as most users are seeing the metrics via dashboards and docs that can guide them in slowly rather than seeing raw dumps of every available instrument.

This shouldn't matter from a usage point of view, right? Could we make a static signleton for now and later switch to DI without breaking users?

One area it might matter is with the Meter config work. For example in logging there is no concept of a static singleton static Logger s_logger = new Logger("System.Runtime"). This means that APIs like LoggingBuilder.AddFilter("System.Runtime", ...) only apply to ILoggers that were created from the associated ILoggerFactory and not to these singletons that have no connection to the DI container. If there was an equivalent MetricsBuilder.AddFilter("System.Runtime", ...) it raises the question, would it recognize the static singleton, would it be ignored, would a special overload or different API call be required for static things?
I'm guessing static singleton is the likely outcome and this is probably orthogonal to most of the other design choices so we could just assume that is the case for now. But at some point we'll need to nail it down.

@samsp-msft
Copy link
Member

A couple of questions:

  • Do we want to take a dependency on DI for basic counters? ASP.NET Core is all-in on DI, so its not a concern for that workload, but is it something that all processes will be using?
  • About half of the existing "runtime" counters are for the GC. Should they be namespaced into something more specific for the GC, and maybe have one more general counter for the overall memory usage - the number that once exceeded will cause kubernetes etc to kill the process, and have that in the main bucket - is that the working set?
    • I'm thinking when you go to something like App Insights and a hierarchy of metrics is shown, how do customers know which are the most important metrics to monitor to understand the process health, and base their load balancing etc on.
    • It feels like some of these counters are more for diagnostics purposes than for proactive health monitoring - should we be naming/namespacing then to make it clearer to users on how counters are intended to be used?
  • Which dimensions should be added to counters is a lot less obvious than for http/networking
    • GC counters are global and I don't think we want to start tracking the types for each allocation
    • Exception count could be bucketed by the type of the exception - but i'd be worried about an explosion in dimension values
    • Deeper diagnostics on what is on the threadpool queue feels like something that should be exposed through the debugger, rather than as dimensions.
    • JIT counters such as method-jitted-count could be dimensioned based on the assembly or namespace. I am wondering if that would help with cases where lambda's are causing constant re-jit to be more easily understood?

@noahfalk
Copy link
Member

Do we want to take a dependency on DI for basic counters?

Recently I've been assuming that the runtime counters will be a static singleton Meter defined in System.Diagnostics.DiagnosticSource, so no dependency on DI there. However if want to listen to it via the Meter config work, that would take a DI dependency. Other ways such MeterListener, OpenTelemetry, or external tools do not require DI.

About half of the existing "runtime" counters are for the GC. Should they be namespaced into something more specific for the GC, and maybe have one more general counter for the overall memory usage

I think a good portion of the counters are the result of EventCounters compensating for not having dimensions. For example there are 5 different heap size metrics (gen0, gen1, gen2, LOH, POH) that could probably be a single metric with a dimension. I'd suggest we start with a design that doesn't explicitly namespace them and if it still feels overwhelming then we think on how we'd split them.

is that the working set?

Probably. I like OpenTelemetry's approach of separating these metrics as 'Process' metrics rather than 'Runtime'. These high-level stats like CPU-usage and VM-usage are measurements the OS tracks for all processes rather than anything specific to a language runtime like Java, Python, or .NET.

I'm thinking when you go to something like App Insights and a hierarchy of metrics is shown, how do customers know which are the most important metrics to monitor to understand the process health, and base their load balancing etc on.

I'm hoping we don't have a huge number of them + we can provide better guidance than we currently do. Today our docs mostly say "Here is what each counter measures". I think we should get to the point where the docs say "These are the counters we think are most useful for health monitoring, here is a default dashboard you can use, here is how you might use this data to start an investigation of different common problems..."

should we be naming/namespacing then to make it clearer to users on how counters are intended to be used?

I'm hoping we aren't going to have so many that more elaborate naming schemes are needed, but I certainly don't rule it out. I'd propose starting with something that looks like OTel's runtime metrics but using .NET's traditional naming conventions.

Which dimensions should be added to counters is a lot less obvious than for http/networking...

I think looking at what OpenTelemetry did with runtime metrics is a good starting point. I'm guessing we'd land somewhere quite similar.

@MihaZupan
Copy link
Member

From #79459 (comment):

Hey folks, we had another ask from the same team for an additional memory metric.

Ask: The ask is for visibility into the free space in each generation of the heap.

Context: We are seeing sometimes large heap footprints and are unable to determine if they are mostly free for whatever reason uncompacted (transient allocation activity leaving large uncompated holes) or if they are actually rooted and we need to intervene and reboot the node and investigate as a likely cause for suspecting leak.

@davidfowl
Copy link
Member

cc @Maoni0 for the last metric

@stevejgordon
Copy link
Contributor

@noahfalk There is an ongoing discussion at open-telemetry/semantic-conventions#956 about introducing a semantic convention for .NET CLR runtime metrics to align with the existing Java conventions. I'm looking at whether this is something I (and Elastic) can contribute to. Would this be a useful exercise as the groundwork for this issue in terms of planning the instrumentation types, attributes etc?

Regarding naming, it sounds like the preference will be to use .NET naming conventions for any new metrics. Is an option to choose the naming based on an "opt-in to semantic conventions" environment variable practical? Has anything like that been done elsewhere in the runtime?

I've got some cycles coming up to spend on this effort, so looking at places I can help get stuck in and try to add some value.

It would certainly be nice to get to a place where we have built-in metrics and additional contrib libraries are not needed.

@noahfalk
Copy link
Member

noahfalk commented May 4, 2024

Hey Steve, thanks for reaching out and volunteering! I very much agree it would be nice to get these in the runtime by default.

I think this item was in a list of items that @tarekgh was tracking. @tarekgh - any concerns if Steve were to run with this? Just so you are aware I know .NET 9 probably feels far off but we've got about ~2 months where its not too hard to get PRs merged. Once we hit July the bar starts going up and its probably not long until feature-sized work not yet merged gets automatically postponed until .NET 10.

Regarding naming, it sounds like the preference

The naming discussion above is out-of-date now. I think somewhere around July last year we made a shift in strategy, renamed all the metrics using OTel conventions, and pushed to get all conventions we were depending on marked stable quickly before .NET 8 shipped. So far I'm feeling pretty happy we did that because it eliminated the bifurcation between .NET naming conventions and OTel naming conventions which was causing confusion. As for the runtime metrics I think it gives us a clear path - we'd use OTel naming conventions only.

I know above there was also some discussion about which metrics should we have and how to organize them. I think I was leaving it too open-ended before. Now I'd suggest we should adopt the conventions already implemented by OTel's runtime instrumentation as the presumptive design. If folks have feedback or want to propose changes to that design we can certainly do so.

How does that sound?

@tarekgh
Copy link
Member

tarekgh commented May 4, 2024

I am fine having Steve start it. Please keep me involved as I guess I am still need to handle the design review.

@stevejgordon
Copy link
Contributor

@noahfalk That sounds great, and much of the complexity is resolved now. I'd be happy to specify the planned metrics and attributes int the semantic conventions based on what the OTel SDK already generates today. In terms of my time, I'd be happy to contribute to the implementation too. I wasn't expecting this to move quite as fast to get the agreements to move forward! I'm OOO this week and the last two weeks in May, but I'll try to focus on this in the week between and lay the groundwork at least.

@stevejgordon
Copy link
Contributor

@tarekgh Do you want me to create a new API review issue for the proposed new public types needed to implement this?

@tarekgh
Copy link
Member

tarekgh commented May 7, 2024

Do you want me to create a new API review issue for the proposed new public types needed to implement this?

No, please use this issue to add the proposal on the top. If you cannot access that, please paste it as a comment and I'll move it to the top when we all agree on the shape of the proposal. We need to have one place to look. Thanks!

@stevejgordon
Copy link
Contributor

@noahfalk / @tarekgh - I've opened an initial semantic convention PR to propose adding experimental runtime metrics.

@stevejgordon
Copy link
Contributor

@noahfalk, had you envisioned a plan for the static Meter that would be used here? As far as I can see, this would be the first static Meter without some direct instrumentation code that could trigger it's initialisation. Ideally, an app owner could observe the meter just by knowing the name; however, making the type static means it won't be initialised until one of the static fields is accessed by something.

We could port almost as-is from the contrib code, but that has the advantage that the extension method used to add the instrumentation can also trigger the static initialisation. For a likely use case with OTel, if we did it like the code below, nothing would happen because the Meter isn't actually created.

using System.Diagnostics.Metrics;
using OpenTelemetry;
using OpenTelemetry.Metrics;

using var meterProvider = Sdk.CreateMeterProviderBuilder()
    .AddMeter("Microsoft.Diagnostics.Runtime")
    .AddConsoleExporter((_, m) => m.PeriodicExportingMetricReaderOptions.ExportIntervalMilliseconds = 10000)
    .Build();

GC.Collect(0);
Console.ReadKey();

internal static class RuntimeMetrics
{
    public const string MeterName = "Microsoft.Diagnostics.Runtime";

    private static readonly Meter s_meter = new(MeterName);

    private static readonly string[] GenNames = ["gen0", "gen1", "gen2", "loh", "poh"];

    static RuntimeMetrics() 
    {
        _ = s_meter.CreateObservableCounter(
            "process.runtime.dotnet.gc.collections.count",
            GetGarbageCollectionCounts,
            description: "Number of garbage collections that have occurred since the process started.");
    }

    private static IEnumerable<Measurement<long>> GetGarbageCollectionCounts()
    {
        long collectionsFromHigherGeneration = 0;

        for (int gen = 2; gen >= 0; --gen)
        {
            long collectionsFromThisGeneration = GC.CollectionCount(gen);
            yield return new(collectionsFromThisGeneration - collectionsFromHigherGeneration, new KeyValuePair<string, object?>("generation", GenNames[gen]));
            collectionsFromHigherGeneration = collectionsFromThisGeneration;
        }
    }
}

@tarekgh
Copy link
Member

tarekgh commented May 14, 2024

I've opened an initial open-telemetry/semantic-conventions#1035 to propose adding experimental runtime metrics.

@stevejgordon I am wondering why we need to have this in OpenTelemetry? Why can't we list the detailed proposal here and then proceed with that?

@stevejgordon
Copy link
Contributor

@tarekgh I'm trying to drive both the implementation here and also getting CLR metrics into the conventions as a first-class citizen, as JVM recently did. I'm happy to summarise the metrics names/types, etc., here also as we work on the implementation.

@tarekgh
Copy link
Member

tarekgh commented May 14, 2024

@stevejgordon thanks. Just to make sure I understand. Is it planned to have anything in the OpenTelemetry side? and why? Sorry if I am missing something here.

never mind. I looked at the PR and I am seeing it is just a doc.

@noahfalk
Copy link
Member

Ideally, an app owner could observe the meter just by knowing the name; however, making the type static means it won't be initialised until one of the static fields is accessed by something.

What if we force anyone who creates a MeterListener to implicitly initialize it? Something like this:

public MeterListener()
{
    EnsureBuiltinMetersInitialized();
}

static void EnsureBuiltinMetersInitialized()
{
    RuntimeMeter.Initialize();
}

@tarekgh
Copy link
Member

tarekgh commented May 15, 2024

@noahfalk would your suggestion make MetricsEventSource work if no-one created a listener? I mean tools like dotnet-counters can report the runtime metrics for apps didn't initialize a listener?

@JamesNK
Copy link
Member Author

JamesNK commented May 15, 2024

I've opened an initial open-telemetry/semantic-conventions#1035 to propose adding experimental runtime metrics.

@stevejgordon I am wondering why we need to have this in OpenTelemetry? Why can't we list the detailed proposal here and then proceed with that?

I highly recommend getting input from OTEL experts, e.g. @lmolkova, on counters, tags and naming. It was a great help when putting together aspnetcore metrics.

Also, we should document the metrics on learn.microsoft.com and OTEL semantic conventions docs. With aspnetcore metrics there is lightweight docs on learn.microsoft.com, and links to details docs on OTEL semantic conventions for people who want more detail.

@noahfalk
Copy link
Member

I mean tools like dotnet-counters can report the runtime metrics for apps didn't initialize a listener?

dotnet-counters connects to the MetricsEventSource which uses a MeterListener internally to obtain the data. There shouldn't be any alternative path to get the data from arbitrary Meters (excluding truly shady approaches like private reflection).

I highly recommend getting input from OTEL experts

+1. I think that is the path we are already on by virtue of posting the sem-conv proposal in the OTel repo.

@stevejgordon
Copy link
Contributor

@noahfalk Yeah, I was heading in that direction myself, although I was hoping to avoid it if there was some clever way. One thing I did consider this morning was whether AppContext.Setup could also be used to trigger the initialisation. Using the MeterListener should work, though, as someone, whether that be an end user or the OTel SDK, will end up creating a listener for this. if they care about observing the metrics.

@noahfalk
Copy link
Member

Btw @lmolkova is currently out, but she is scheduled to be back in a week. I'm glad to get other feedback but I do want to get her feedback specifically on this one :)

@stevejgordon
Copy link
Contributor

I'm out for two weeks starting on Monday, but I will keep an eye on these discussions. I'll continue playing with a POC to implement this, and once we have a design, we can determine what (if anything) needs to be prepared for API review, etc.

@tarekgh
Copy link
Member

tarekgh commented May 15, 2024

@noahfalk @stevejgordon looking at the PR open-telemetry/semantic-conventions#1035 and I am seeing the proposal is missing at least three metrics comparing to what we expose in https://learn.microsoft.com/en-us/dotnet/core/diagnostics/available-counters#systemruntime-counters.

  • Working Set (working-set) The number of megabytes of physical memory mapped to the process context at a point in time based on Environment.WorkingSet.
  • Gen 0 GC Budget based on GC.GetGenerationBudget(0)
  • % Time in GC since last GC based on GC.GetLastGCPercentTimeInGC().

https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/RuntimeEventSource.cs

Is it intentional we don't want to include these?

@stevejgordon
Copy link
Contributor

Is it intentional we don't want to include these?

@tarekgh, not really. @noahfalk suggested starting with the metrics exposed via the existing OTel contrib library, so I didn't review the runtime event source. We can consider proposing those, too, or they could be added later.

@noahfalk An alternative implementation I've been thinking about last night is whether we should consider adding an EventListener for the GC events which could be used to set most of the GC instruments. One advantage of this is that instead of an observable instrument, we can switch to non-observable forms, with their values updated only after GC events. For the GC metrics, at least, I think most (maybe all) of them will only change after a GC occurs. Perhaps preferring events is slightly more efficient since we don't poll the various GC.xyz() methods. I've not dug deeply into this yet, but I'm throwing it out for discussion. Perhaps @Maoni0 has a view of the "best" way to collect and update GC metrics?

@KalleOlaviNiemitalo
Copy link

Can those GC events be received by an EventListener? I'd imagine that the events are fired to ETW from the unmanaged part of the runtime, rather than by EventSource, and that makes them invisible to EventListener.

@stevejgordon
Copy link
Contributor

Can those GC events be received by an EventListener? I'd imagine that the events are fired to ETW from the unmanaged part of the runtime, rather than by EventSource, and that makes them invisible to EventListener.

@KalleOlaviNiemitalo, I believe they are piped through and can be observed as per this post from @Maoni0.

The reason I am considering this as an option is it opens the door to collecting GC duration and perhaps some other useful metrics if we base at least some of them on these richer events.

@KalleOlaviNiemitalo
Copy link

I see; the events are buffered in unmanaged EventPipe code, and a thread in managed code pulls them via EventPipeInternal.GetNextEvent, so the runtime doesn't need to call managed code in the middle of garbage collection.

@noahfalk
Copy link
Member

Working Set (working-set) The number of megabytes of physical memory mapped to the process context at a point in time based on Environment.WorkingSet.

I would deliberately not include this one. OpenTelemetry includes WorkingSet, Cpu, and other OS level metrics in a separate group of process metrics. I think its fine if we had a built-in implementation of process metrics too, I just wouldn't lump them in the same Meter with the runtime metrics.

Gen 0 GC Budget based on GC.GetGenerationBudget(0)

I'd be fine with it as long as @Maoni0 is. It also raises the question if we only want the gen0 value of this or do we want higher generation budgets too.

% Time in GC since last GC based on GC.GetLastGCPercentTimeInGC().

This metric has history as being confusing and I think folks would be better off observing the rate of change in the clr.gc.pause.time metric. We did look at adding this to the OTel metrics in the past and decided against it. Some past discussion.

An alternative implementation I've been thinking about last night is whether we should consider adding an EventListener for the GC events which could be used to set most of the GC instruments

Although functionally it works I'd worry you are going to incur higher perf overheads for no clear benefit. Creating the first EventListener in the process requires a thread to pump the events for a callback + blocks of virtual memory are allocated to store the buffered events prior to dispatching them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-System.Diagnostics.Metric enhancement Product code improvement that does NOT require public API changes/additions feature-request
Projects
None yet
Development

No branches or pull requests