Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce CpuStat.getSystemCpuLoadTicks memory allocation pressure #2605

Merged
merged 2 commits into from
Mar 20, 2024

Conversation

chrisribble
Copy link
Contributor

@chrisribble chrisribble commented Mar 19, 2024

Add a new method, FileUtil.readLines, which accepts a number of lines to read and allocates an ArrayList which contains a number of elements corresponding to the number of lines to be read. If file is not readable, return Collections.emptyList() to avoid allocating a list.

Use this new method in CpuStat.getSystemCpuLoadTicks to avoid allocating a String for each line in /proc/stat, instead allocating a single String and ArrayList with a capacity of one.

This reduces allocation of char[]'s to back the String for each line by two orders of magnitude (from a few kB to a few tens of bytes) on each call.

Add a new method, FileUtil.readLines, which accepts a number of lines to
read and allocates an ArrayList which contains a number of elements
corresponding to the number of lines to be read. If file is not
readable, return Collections.emptyList() to avoid allocating a list.

Use this new method in CpuStat.getSystemCpuLoadTicks to avoid allocating
a String for each line /proc/stat, instead allocating a single String
and ArrayList with a capacity of one.

This reduces allocation of byte[]'s to back the String for each line by
two orders of magnitude (from a few kB to a few tens of bytes) on each
call.
@chrisribble
Copy link
Contributor Author

chrisribble commented Mar 19, 2024

On my system, /proc/stat contains 5927 bytes of data currently:

$ cat /proc/stat | wc --bytes
5927

However, the first line /proc/stat contains a mere 55 bytes:

$ cat /proc/stat | head -n 1 | wc --bytes
55

I noticed while profiling one of our applications with Java Flight Recorder and Java Mission Control that CpuStat.getSystemCpuLoadTicks() - which we invoke twice a second so that we have accurate data about recent CPU usage - was responsible for about 10% of all char[] allocations under light load. This surprised me and motivated this PR.

@dbwiddis I tried to mirror the existing code style while keeping the size of this change as minimal as possible, but let me know if you'd like to see a different approach.

Copy link

codecov bot commented Mar 20, 2024

Codecov Report

Attention: Patch coverage is 84.61538% with 4 lines in your changes are missing coverage. Please review.

Project coverage is 86.88%. Comparing base (e3ef48f) to head (12d0a62).

❗ Current head 12d0a62 differs from pull request most recent head 6c19940. Consider uploading reports for the commit 6c19940 to get more accurate results

Files Patch % Lines
oshi-core/src/main/java/oshi/util/FileUtil.java 84.61% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2605      +/-   ##
==========================================
- Coverage   87.05%   86.88%   -0.18%     
==========================================
  Files          30       30              
  Lines        1313     1334      +21     
  Branches      183      189       +6     
==========================================
+ Hits         1143     1159      +16     
- Misses         98      102       +4     
- Partials       72       73       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dbwiddis
Copy link
Member

Hey @chrisribble thanks a ton for this PR! Given that CPU ticks are a "hot path" type operation I'm more than happy to try to optimize here.

I tried to mirror the existing code style

A lot of this file was written in ~2015 with JDK 6 compatibility in mind. We've been JDK8 minimum for a while, and I wonder if you can't do even better here using Files.lines() in a stream.

Read all lines from a file as a Stream. Unlike readAllLines, this method does not read all lines into a List, but instead populates lazily as the stream is consumed.

It may be no better than what you've done, but as long as we're making an improvement, would be nice to compare these options.

@chrisribble
Copy link
Contributor Author

chrisribble commented Mar 20, 2024

Hey @chrisribble thanks a ton for this PR! Given that CPU ticks are a "hot path" type operation I'm more than happy to try to optimize here.

I tried to mirror the existing code style

A lot of this file was written in ~2015 with JDK 6 compatibility in mind. We've been JDK8 minimum for a while, and I wonder if you can't do even better here using Files.lines() in a stream.

Read all lines from a file as a Stream. Unlike readAllLines, this method does not read all lines into a List, but instead populates lazily as the stream is consumed.

It may be no better than what you've done, but as long as we're making an improvement, would be nice to compare these options.

@dbwiddis do you just want to explore updating the implementation here to use Files.lines(file) to implement FileUtils.readLines instead of the imperative implementation currently in the PR? Or did you have something more ambitious in mind WRT the API exposed by FileUtils?

I wrote a simple JMH benchmark (and enabled the gc profiler) for:

  1. The old implementation (Files.readAllLines)
  2. The implementation in this PR (sized ArrayList and imperative iteration)
  3. Files.lines(file).limit(count).collect(Collectors.toList())
  4. Files.lines(file).limit(count).collect(Collectors.toCollection(() -> new ArrayList<>(count)))

Benchmark results:

OshiBenchmark.readFile                                        avgt    5     13.793 ±   0.759   us/op
OshiBenchmark.readFile:gc.alloc.rate                          avgt    5   2263.138 ± 123.886  MB/sec
OshiBenchmark.readFile:gc.alloc.rate.norm                     avgt    5  32728.019 ±   0.001    B/op
OshiBenchmark.readFile:gc.count                               avgt    5    157.000            counts
OshiBenchmark.readFile:gc.time                                avgt    5     91.000                ms
OshiBenchmark.readLinesImperative                             avgt    5     11.597 ±   0.350   us/op
OshiBenchmark.readLinesImperative:gc.alloc.rate               avgt    5   2109.155 ±  63.629  MB/sec
OshiBenchmark.readLinesImperative:gc.alloc.rate.norm          avgt    5  25648.016 ±   0.001    B/op
OshiBenchmark.readLinesImperative:gc.count                    avgt    5    147.000            counts
OshiBenchmark.readLinesImperative:gc.time                     avgt    5     91.000                ms
OshiBenchmark.readLinesStreamToCollection                     avgt    5     13.041 ±   0.431   us/op
OshiBenchmark.readLinesStreamToCollection:gc.alloc.rate       avgt    5   1933.009 ±  63.508  MB/sec
OshiBenchmark.readLinesStreamToCollection:gc.alloc.rate.norm  avgt    5  26432.018 ±   0.001    B/op
OshiBenchmark.readLinesStreamToCollection:gc.count            avgt    5    135.000            counts
OshiBenchmark.readLinesStreamToCollection:gc.time             avgt    5     88.000                ms
OshiBenchmark.readLinesStreamToList                           avgt    5     12.887 ±   0.166   us/op
OshiBenchmark.readLinesStreamToList:gc.alloc.rate             avgt    5   1959.552 ±  25.230  MB/sec
OshiBenchmark.readLinesStreamToList:gc.alloc.rate.norm        avgt    5  26480.018 ±   0.001    B/op
OshiBenchmark.readLinesStreamToList:gc.count                  avgt    5    136.000            counts
OshiBenchmark.readLinesStreamToList:gc.time                   avgt    5     84.000                ms

So the implementation in the PR executes in the least time and allocates the least amount of memory (although the fact that 25k is allocated is definitely something I want to attack in another PR ... probably due to the default 8k buffer size used by the JRE). I'm happy to change it to use Files.lines(file).limit(count).collect(Collectors.toList()), since it's a pretty tiny amount of extra memory, but I think I will have to handle UncheckedIOException and it may be difficult to tune the underlying buffer size later.

But perhaps you actually were wanting to explore returning Stream instead of a List so that the caller can control how much data is processed?

One thing to consider if we return a Stream<String> is that the caller will be responsible for closing it (to avoid leaking file handles/channels until the Stream is GC'd). This also means that the caller may have to deal with UncheckedIOException that can be thrown by the underlying FileChannelLinesSpliterator when a terminal operator is invoked on the Stream (and also potentially when the Stream is closed).

For example:

	// Caller has to close this `Stream` (i.e. using TWR pattern)
	public static Stream<String> readLines(final String filename, final boolean reportError) {
		Path file = Paths.get(filename);
		if (Files.isReadable(file)) {
			if (LOG.isDebugEnabled()) {
				LOG.debug(READING_LOG, filename);
			}
			try {
				return Files.lines(file);
			} catch (IOException | UncheckedIOException e) {
				// Does this exception handling make sense if the caller has to do it too on close/terminal op?
				if (reportError) {
					LOG.error("Error reading file {}. {}", filename, e.getMessage());
				} else {
					LOG.debug("Error reading file {}. {}", filename, e.getMessage());
				}
			}
		} else if (reportError) {
			LOG.warn("File not found or not readable: {}", filename);
		}
		return Stream.empty();
	}

@dbwiddis
Copy link
Member

So the implementation in the PR executes in the least time and allocates the least amount of memory

Thanks for doing the benchmarks! I'm totally fine with merging this PR as-is for this benefit. Goes to prove the "write dumb code" theory that the JIT compiler knows just what to do with this "low level" code.

My thought on the stream APIs is to just get rid of this util class altogether; it existed in a time before those APIs existed (pretty sure readAllAlines() replaced something else). But rather than Files.lines(file).limit(count).collect(Collectors.toList()) we'd replace individual "just read the first line" implementations with something like Files.lines(file).limit(count).forEach(s -> doTheThingWith(s)). Or some util abstraction that we can pass a function to.

Again, not really worth it unless it gains us benefit.

Can you add a change log entry and do a quick mvn spotless:apply? I'll merge this as-is and we can discuss any other improvements. I really do appreciate you helping improve performance here.

@chrisribble
Copy link
Contributor Author

chrisribble commented Mar 20, 2024

So the implementation in the PR executes in the least time and allocates the least amount of memory

Thanks for doing the benchmarks! I'm totally fine with merging this PR as-is for this benefit. Goes to prove the "write dumb code" theory that the JIT compiler knows just what to do with this "low level" code.

My thought on the stream APIs is to just get rid of this util class altogether; it existed in a time before those APIs existed (pretty sure readAllAlines() replaced something else). But rather than Files.lines(file).limit(count).collect(Collectors.toList()) we'd replace individual "just read the first line" implementations with something like Files.lines(file).limit(count).forEach(s -> doTheThingWith(s)). Or some util abstraction that we can pass a function to.

Again, not really worth it unless it gains us benefit.

Can you add a change log entry and do a quick mvn spotless:apply? I'll merge this as-is and we can discuss any other improvements. I really do appreciate you helping improve performance here.

Sounds good. I agree that eliminating the utility class is probably better in the long-run.

I added the changelog entry (hopefully I did it right) and checked in the results of mvn spotless:apply.

Btw, I also experimented with specifying a buffer size of 100 bytes and that yields another nice speedup and reduces memory allocation rates by quite a lot (down to just over 9k per execution):

Benchmark                                                       Mode  Cnt      Score    Error   Units
OshiBenchmark.readFile                                          avgt    5     13.712 ±  0.205   us/op
OshiBenchmark.readFile:gc.alloc.rate                            avgt    5   2276.259 ± 34.230  MB/sec
OshiBenchmark.readFile:gc.alloc.rate.norm                       avgt    5  32728.019 ±  0.001    B/op
OshiBenchmark.readFile:gc.count                                 avgt    5    158.000           counts
OshiBenchmark.readFile:gc.time                                  avgt    5    105.000               ms
OshiBenchmark.readLinesImperative                               avgt    5     11.581 ±  0.295   us/op
OshiBenchmark.readLinesImperative:gc.alloc.rate                 avgt    5   2108.791 ± 53.969  MB/sec
OshiBenchmark.readLinesImperative:gc.alloc.rate.norm            avgt    5  25608.016 ±  0.001    B/op
OshiBenchmark.readLinesImperative:gc.count                      avgt    5    147.000           counts
OshiBenchmark.readLinesImperative:gc.time                       avgt    5     96.000               ms
OshiBenchmark.readLinesImperativeBufferSize                     avgt    5      9.923 ±  0.482   us/op
OshiBenchmark.readLinesImperativeBufferSize:gc.alloc.rate       avgt    5    905.831 ± 44.118  MB/sec
OshiBenchmark.readLinesImperativeBufferSize:gc.alloc.rate.norm  avgt    5   9424.014 ±  0.001    B/op
OshiBenchmark.readLinesImperativeBufferSize:gc.count            avgt    5     63.000           counts
OshiBenchmark.readLinesImperativeBufferSize:gc.time             avgt    5     44.000               ms
OshiBenchmark.readLinesStreamToCollection                       avgt    5     13.072 ±  0.129   us/op
OshiBenchmark.readLinesStreamToCollection:gc.alloc.rate         avgt    5   1928.262 ± 18.985  MB/sec
OshiBenchmark.readLinesStreamToCollection:gc.alloc.rate.norm    avgt    5  26432.018 ±  0.001    B/op
OshiBenchmark.readLinesStreamToCollection:gc.count              avgt    5    135.000           counts
OshiBenchmark.readLinesStreamToCollection:gc.time               avgt    5     86.000               ms
OshiBenchmark.readLinesStreamToList                             avgt    5     12.868 ±  0.091   us/op
OshiBenchmark.readLinesStreamToList:gc.alloc.rate               avgt    5   1962.367 ± 13.867  MB/sec
OshiBenchmark.readLinesStreamToList:gc.alloc.rate.norm          avgt    5  26480.018 ±  0.001    B/op
OshiBenchmark.readLinesStreamToList:gc.count                    avgt    5    136.000           counts
OshiBenchmark.readLinesStreamToList:gc.time                     avgt    5     91.000               ms

I think that just makes the API more obtuse, though, since the caller (the place where you know roughly how much data you're going to read) would need to specify the buffer size.

I'm happy to push those changes in a separate PR where we can discuss them or maybe think some more about a better way to expose "read M bytes from N lines in file foo".

Or maybe it makes sense to think about just trying to pick a default buffer size that aligns better with the average size of files that Oshi is reading? The default JRE size of 8k is quite large for most things in /proc probably?

@dbwiddis dbwiddis merged commit 09d3101 into oshi:master Mar 20, 2024
14 checks passed
@dbwiddis
Copy link
Member

I'm happy to push those changes in a separate PR where we can discuss them or maybe think some more about a better way to expose "read M bytes from N lines in file foo".

How about opening an issue to discuss, so it's easier to search for later?

Also take a look at the Auxv class (which reads /proc/auxv as binary). Seems there's something similar we could do with "read until we hit n newlines" rather than a fixed number of bytes.

@dbwiddis
Copy link
Member

One other idea. Old school, but why can't we just go old school and use Scanner? It has a 1KB buffer rather than 8K. And we could just read in the values into a long[] while we scan.....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants