Reduce CpuStat.getSystemCpuLoadTicks memory allocation pressure #2605

chrisribble · 2024-03-19T22:38:22Z

Add a new method, FileUtil.readLines, which accepts a number of lines to read and allocates an ArrayList which contains a number of elements corresponding to the number of lines to be read. If file is not readable, return Collections.emptyList() to avoid allocating a list.

Use this new method in CpuStat.getSystemCpuLoadTicks to avoid allocating a String for each line in /proc/stat, instead allocating a single String and ArrayList with a capacity of one.

This reduces allocation of char[]'s to back the String for each line by two orders of magnitude (from a few kB to a few tens of bytes) on each call.

Add a new method, FileUtil.readLines, which accepts a number of lines to read and allocates an ArrayList which contains a number of elements corresponding to the number of lines to be read. If file is not readable, return Collections.emptyList() to avoid allocating a list. Use this new method in CpuStat.getSystemCpuLoadTicks to avoid allocating a String for each line /proc/stat, instead allocating a single String and ArrayList with a capacity of one. This reduces allocation of byte[]'s to back the String for each line by two orders of magnitude (from a few kB to a few tens of bytes) on each call.

chrisribble · 2024-03-19T22:48:09Z

On my system, /proc/stat contains 5927 bytes of data currently:

$ cat /proc/stat | wc --bytes
5927

However, the first line /proc/stat contains a mere 55 bytes:

$ cat /proc/stat | head -n 1 | wc --bytes
55

I noticed while profiling one of our applications with Java Flight Recorder and Java Mission Control that CpuStat.getSystemCpuLoadTicks() - which we invoke twice a second so that we have accurate data about recent CPU usage - was responsible for about 10% of all char[] allocations under light load. This surprised me and motivated this PR.

@dbwiddis I tried to mirror the existing code style while keeping the size of this change as minimal as possible, but let me know if you'd like to see a different approach.

codecov · 2024-03-20T05:20:33Z

Codecov Report

Attention: Patch coverage is 84.61538% with 4 lines in your changes are missing coverage. Please review.

Project coverage is 86.88%. Comparing base (e3ef48f) to head (12d0a62).

❗ Current head 12d0a62 differs from pull request most recent head 6c19940. Consider uploading reports for the commit 6c19940 to get more accurate results

Files	Patch %	Lines
oshi-core/src/main/java/oshi/util/FileUtil.java	84.61%	2 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2605      +/-   ##
==========================================
- Coverage   87.05%   86.88%   -0.18%     
==========================================
  Files          30       30              
  Lines        1313     1334      +21     
  Branches      183      189       +6     
==========================================
+ Hits         1143     1159      +16     
- Misses         98      102       +4     
- Partials       72       73       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

dbwiddis · 2024-03-20T05:33:27Z

Hey @chrisribble thanks a ton for this PR! Given that CPU ticks are a "hot path" type operation I'm more than happy to try to optimize here.

I tried to mirror the existing code style

A lot of this file was written in ~2015 with JDK 6 compatibility in mind. We've been JDK8 minimum for a while, and I wonder if you can't do even better here using Files.lines() in a stream.

Read all lines from a file as a Stream. Unlike readAllLines, this method does not read all lines into a List, but instead populates lazily as the stream is consumed.

It may be no better than what you've done, but as long as we're making an improvement, would be nice to compare these options.

chrisribble · 2024-03-20T06:47:59Z

Hey @chrisribble thanks a ton for this PR! Given that CPU ticks are a "hot path" type operation I'm more than happy to try to optimize here.

I tried to mirror the existing code style

A lot of this file was written in ~2015 with JDK 6 compatibility in mind. We've been JDK8 minimum for a while, and I wonder if you can't do even better here using Files.lines() in a stream.

Read all lines from a file as a Stream. Unlike readAllLines, this method does not read all lines into a List, but instead populates lazily as the stream is consumed.

It may be no better than what you've done, but as long as we're making an improvement, would be nice to compare these options.

@dbwiddis do you just want to explore updating the implementation here to use Files.lines(file) to implement FileUtils.readLines instead of the imperative implementation currently in the PR? Or did you have something more ambitious in mind WRT the API exposed by FileUtils?

I wrote a simple JMH benchmark (and enabled the gc profiler) for:

The old implementation (Files.readAllLines)
The implementation in this PR (sized ArrayList and imperative iteration)
Files.lines(file).limit(count).collect(Collectors.toList())
Files.lines(file).limit(count).collect(Collectors.toCollection(() -> new ArrayList<>(count)))

Benchmark results:

OshiBenchmark.readFile                                        avgt    5     13.793 ±   0.759   us/op
OshiBenchmark.readFile:gc.alloc.rate                          avgt    5   2263.138 ± 123.886  MB/sec
OshiBenchmark.readFile:gc.alloc.rate.norm                     avgt    5  32728.019 ±   0.001    B/op
OshiBenchmark.readFile:gc.count                               avgt    5    157.000            counts
OshiBenchmark.readFile:gc.time                                avgt    5     91.000                ms
OshiBenchmark.readLinesImperative                             avgt    5     11.597 ±   0.350   us/op
OshiBenchmark.readLinesImperative:gc.alloc.rate               avgt    5   2109.155 ±  63.629  MB/sec
OshiBenchmark.readLinesImperative:gc.alloc.rate.norm          avgt    5  25648.016 ±   0.001    B/op
OshiBenchmark.readLinesImperative:gc.count                    avgt    5    147.000            counts
OshiBenchmark.readLinesImperative:gc.time                     avgt    5     91.000                ms
OshiBenchmark.readLinesStreamToCollection                     avgt    5     13.041 ±   0.431   us/op
OshiBenchmark.readLinesStreamToCollection:gc.alloc.rate       avgt    5   1933.009 ±  63.508  MB/sec
OshiBenchmark.readLinesStreamToCollection:gc.alloc.rate.norm  avgt    5  26432.018 ±   0.001    B/op
OshiBenchmark.readLinesStreamToCollection:gc.count            avgt    5    135.000            counts
OshiBenchmark.readLinesStreamToCollection:gc.time             avgt    5     88.000                ms
OshiBenchmark.readLinesStreamToList                           avgt    5     12.887 ±   0.166   us/op
OshiBenchmark.readLinesStreamToList:gc.alloc.rate             avgt    5   1959.552 ±  25.230  MB/sec
OshiBenchmark.readLinesStreamToList:gc.alloc.rate.norm        avgt    5  26480.018 ±   0.001    B/op
OshiBenchmark.readLinesStreamToList:gc.count                  avgt    5    136.000            counts
OshiBenchmark.readLinesStreamToList:gc.time                   avgt    5     84.000                ms

So the implementation in the PR executes in the least time and allocates the least amount of memory (although the fact that 25k is allocated is definitely something I want to attack in another PR ... probably due to the default 8k buffer size used by the JRE). I'm happy to change it to use Files.lines(file).limit(count).collect(Collectors.toList()), since it's a pretty tiny amount of extra memory, but I think I will have to handle UncheckedIOException and it may be difficult to tune the underlying buffer size later.

But perhaps you actually were wanting to explore returning Stream instead of a List so that the caller can control how much data is processed?

One thing to consider if we return a Stream<String> is that the caller will be responsible for closing it (to avoid leaking file handles/channels until the Stream is GC'd). This also means that the caller may have to deal with UncheckedIOException that can be thrown by the underlying FileChannelLinesSpliterator when a terminal operator is invoked on the Stream (and also potentially when the Stream is closed).

For example:

	// Caller has to close this `Stream` (i.e. using TWR pattern)
	public static Stream<String> readLines(final String filename, final boolean reportError) {
		Path file = Paths.get(filename);
		if (Files.isReadable(file)) {
			if (LOG.isDebugEnabled()) {
				LOG.debug(READING_LOG, filename);
			}
			try {
				return Files.lines(file);
			} catch (IOException | UncheckedIOException e) {
				// Does this exception handling make sense if the caller has to do it too on close/terminal op?
				if (reportError) {
					LOG.error("Error reading file {}. {}", filename, e.getMessage());
				} else {
					LOG.debug("Error reading file {}. {}", filename, e.getMessage());
				}
			}
		} else if (reportError) {
			LOG.warn("File not found or not readable: {}", filename);
		}
		return Stream.empty();
	}

dbwiddis · 2024-03-20T07:24:57Z

So the implementation in the PR executes in the least time and allocates the least amount of memory

Thanks for doing the benchmarks! I'm totally fine with merging this PR as-is for this benefit. Goes to prove the "write dumb code" theory that the JIT compiler knows just what to do with this "low level" code.

My thought on the stream APIs is to just get rid of this util class altogether; it existed in a time before those APIs existed (pretty sure readAllAlines() replaced something else). But rather than Files.lines(file).limit(count).collect(Collectors.toList()) we'd replace individual "just read the first line" implementations with something like Files.lines(file).limit(count).forEach(s -> doTheThingWith(s)). Or some util abstraction that we can pass a function to.

Again, not really worth it unless it gains us benefit.

Can you add a change log entry and do a quick mvn spotless:apply? I'll merge this as-is and we can discuss any other improvements. I really do appreciate you helping improve performance here.

chrisribble · 2024-03-20T07:35:23Z

So the implementation in the PR executes in the least time and allocates the least amount of memory

Thanks for doing the benchmarks! I'm totally fine with merging this PR as-is for this benefit. Goes to prove the "write dumb code" theory that the JIT compiler knows just what to do with this "low level" code.

My thought on the stream APIs is to just get rid of this util class altogether; it existed in a time before those APIs existed (pretty sure readAllAlines() replaced something else). But rather than Files.lines(file).limit(count).collect(Collectors.toList()) we'd replace individual "just read the first line" implementations with something like Files.lines(file).limit(count).forEach(s -> doTheThingWith(s)). Or some util abstraction that we can pass a function to.

Again, not really worth it unless it gains us benefit.

Can you add a change log entry and do a quick mvn spotless:apply? I'll merge this as-is and we can discuss any other improvements. I really do appreciate you helping improve performance here.

Sounds good. I agree that eliminating the utility class is probably better in the long-run.

I added the changelog entry (hopefully I did it right) and checked in the results of mvn spotless:apply.

Btw, I also experimented with specifying a buffer size of 100 bytes and that yields another nice speedup and reduces memory allocation rates by quite a lot (down to just over 9k per execution):

Benchmark                                                       Mode  Cnt      Score    Error   Units
OshiBenchmark.readFile                                          avgt    5     13.712 ±  0.205   us/op
OshiBenchmark.readFile:gc.alloc.rate                            avgt    5   2276.259 ± 34.230  MB/sec
OshiBenchmark.readFile:gc.alloc.rate.norm                       avgt    5  32728.019 ±  0.001    B/op
OshiBenchmark.readFile:gc.count                                 avgt    5    158.000           counts
OshiBenchmark.readFile:gc.time                                  avgt    5    105.000               ms
OshiBenchmark.readLinesImperative                               avgt    5     11.581 ±  0.295   us/op
OshiBenchmark.readLinesImperative:gc.alloc.rate                 avgt    5   2108.791 ± 53.969  MB/sec
OshiBenchmark.readLinesImperative:gc.alloc.rate.norm            avgt    5  25608.016 ±  0.001    B/op
OshiBenchmark.readLinesImperative:gc.count                      avgt    5    147.000           counts
OshiBenchmark.readLinesImperative:gc.time                       avgt    5     96.000               ms
OshiBenchmark.readLinesImperativeBufferSize                     avgt    5      9.923 ±  0.482   us/op
OshiBenchmark.readLinesImperativeBufferSize:gc.alloc.rate       avgt    5    905.831 ± 44.118  MB/sec
OshiBenchmark.readLinesImperativeBufferSize:gc.alloc.rate.norm  avgt    5   9424.014 ±  0.001    B/op
OshiBenchmark.readLinesImperativeBufferSize:gc.count            avgt    5     63.000           counts
OshiBenchmark.readLinesImperativeBufferSize:gc.time             avgt    5     44.000               ms
OshiBenchmark.readLinesStreamToCollection                       avgt    5     13.072 ±  0.129   us/op
OshiBenchmark.readLinesStreamToCollection:gc.alloc.rate         avgt    5   1928.262 ± 18.985  MB/sec
OshiBenchmark.readLinesStreamToCollection:gc.alloc.rate.norm    avgt    5  26432.018 ±  0.001    B/op
OshiBenchmark.readLinesStreamToCollection:gc.count              avgt    5    135.000           counts
OshiBenchmark.readLinesStreamToCollection:gc.time               avgt    5     86.000               ms
OshiBenchmark.readLinesStreamToList                             avgt    5     12.868 ±  0.091   us/op
OshiBenchmark.readLinesStreamToList:gc.alloc.rate               avgt    5   1962.367 ± 13.867  MB/sec
OshiBenchmark.readLinesStreamToList:gc.alloc.rate.norm          avgt    5  26480.018 ±  0.001    B/op
OshiBenchmark.readLinesStreamToList:gc.count                    avgt    5    136.000           counts
OshiBenchmark.readLinesStreamToList:gc.time                     avgt    5     91.000               ms

I think that just makes the API more obtuse, though, since the caller (the place where you know roughly how much data you're going to read) would need to specify the buffer size.

I'm happy to push those changes in a separate PR where we can discuss them or maybe think some more about a better way to expose "read M bytes from N lines in file foo".

Or maybe it makes sense to think about just trying to pick a default buffer size that aligns better with the average size of files that Oshi is reading? The default JRE size of 8k is quite large for most things in /proc probably?

dbwiddis · 2024-03-20T15:36:41Z

I'm happy to push those changes in a separate PR where we can discuss them or maybe think some more about a better way to expose "read M bytes from N lines in file foo".

How about opening an issue to discuss, so it's easier to search for later?

Also take a look at the Auxv class (which reads /proc/auxv as binary). Seems there's something similar we could do with "read until we hit n newlines" rather than a fixed number of bytes.

dbwiddis · 2024-03-20T16:43:35Z

One other idea. Old school, but why can't we just go old school and use Scanner? It has a 1KB buffer rather than 8K. And we could just read in the values into a long[] while we scan.....

Add changelog and apply spotless formatting

6c19940

dbwiddis merged commit 09d3101 into oshi:master Mar 20, 2024
14 checks passed

chrisribble mentioned this pull request Mar 30, 2024

Improve /proc read semantics without increasing memory allocations #2613

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce CpuStat.getSystemCpuLoadTicks memory allocation pressure #2605

Reduce CpuStat.getSystemCpuLoadTicks memory allocation pressure #2605

chrisribble commented Mar 19, 2024 •

edited

chrisribble commented Mar 19, 2024 •

edited

codecov bot commented Mar 20, 2024 •

edited

dbwiddis commented Mar 20, 2024

chrisribble commented Mar 20, 2024 •

edited

dbwiddis commented Mar 20, 2024

chrisribble commented Mar 20, 2024 •

edited

dbwiddis commented Mar 20, 2024

dbwiddis commented Mar 20, 2024

Reduce CpuStat.getSystemCpuLoadTicks memory allocation pressure #2605

Reduce CpuStat.getSystemCpuLoadTicks memory allocation pressure #2605

Conversation

chrisribble commented Mar 19, 2024 • edited

chrisribble commented Mar 19, 2024 • edited

codecov bot commented Mar 20, 2024 • edited

Codecov Report

dbwiddis commented Mar 20, 2024

chrisribble commented Mar 20, 2024 • edited

dbwiddis commented Mar 20, 2024

chrisribble commented Mar 20, 2024 • edited

dbwiddis commented Mar 20, 2024

dbwiddis commented Mar 20, 2024

chrisribble commented Mar 19, 2024 •

edited

chrisribble commented Mar 19, 2024 •

edited

codecov bot commented Mar 20, 2024 •

edited

chrisribble commented Mar 20, 2024 •

edited

chrisribble commented Mar 20, 2024 •

edited