Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: stream NVD data via Jackson to reduce memory footprint #6275

Merged
merged 1 commit into from Dec 14, 2023

Conversation

philippn
Copy link

Fixes Issue

This is a follow-up of #6196. One of the suggestions made in that thread was to switch to using JSON streaming using Jackson. I have implemented this in this Pull Request.

The results are quite impressive, the initial download and initialization now succeeds with Xmx512m, which was previously not possible.

Description of Change

Introduced new CveItemSource abstraction and using it in NvdApiProcessor instead of reading from the JSON directly. The implementations of CveItemSource take care of the JSON streaming internally using Jackson Streaming API.

Have test cases been added to cover the new functionality?

no, as there is not really any new functionalitity. But can provide one if necessary.

@boring-cyborg boring-cyborg bot added the core changes to core label Dec 11, 2023
@phosphoreslight
Copy link

Great work @philippn ! Sounds very promising. Hope we can get this merged :) Thank you!

@jeremylong
Copy link
Owner

@jeremylong
Copy link
Owner

My previous PR, #6270, will drastically reduce memory usage.

@chadlwilson
Copy link
Contributor

This change appears to build upon the previous changes in #6270 more than anything.

No idea whether it's a significant enough improvement to warrant the extra complexity unless it was compared with the unreleased version of the code before this change, however conceptually it still seems like it could have advantages to try and process entries one-by-one as they are read from the files produced by #6270 rather than them being bulk read from files into a big collection and then the collection separately iterated through. But depends how large those collections/files are, and how much garbage is produced in the process.

@philippn
Copy link
Author

@jeremylong Thanks for your feedback, I appreciate it.

This change aims primarily at improving the situation when using DependencyCheck with the nvdDatafeedUrl switch, because from my experience this case seems to be primarily plagued by the huge memory demand. When using DependencyCheck without that cache, the calls to the upstream API are already paged (thanks for that!) and the memory demands are quite modest due to that.

As for #6196 , your fix definitely improves the situation (again, thanks for that). However, due to the way that the JSON is processed, it still loads the whole document into the memory right now. That is roughly 150 megs per year at the moment multiplied by the amount of concurrent threads (which reproducibly produces OOMs with limited memory).

If you believe this to be better suited for the nvd lib, I would be happy to supply a patch there too or you are free to use the code wherever/however you want.

Many thanks for your work and kind regards,
Philipp

@jeremylong
Copy link
Owner

This PR is somehow causing a regression with the maven integration tests.

@philippn philippn force-pushed the feature/stream-json branch 4 times, most recently from 3c34464 to 657d013 Compare December 13, 2023 23:34
@philippn
Copy link
Author

The test failures were caused by a bug in the initialization of the streaming, i.e. the first cve entry was always skipped. That should be fixed now.

@jeremylong jeremylong added this to the 9.0.6 milestone Dec 14, 2023
@jeremylong
Copy link
Owner

Thanks for the PR. I'll likely also look at the number of threads being spun up.

@jeremylong jeremylong merged commit a747de7 into jeremylong:main Dec 14, 2023
6 checks passed
@knalli
Copy link
Contributor

knalli commented Dec 18, 2023

Hi, please say if you may want a new issue for this rather than this follow-up here. I just noticed the update in the changelog specifically the JSON streaming, and had a look for interest's sake.

Although the latest version has already got some improvements, I see here a hidden memory leak option in case of broken input.

Previously, the I/O handling (managing streams, file handling) itself was handled in NvdApiProcessor (IMHO the right place) which also handled the auto-closing on errors. With this change, the stream creation and parsing have moved both into the source implementation classes. If the readItem() methods fail (unexpected), the source constructor will not be succeeded, which means no instance at all, which means no auto-close with try-catch and therefore the created input streams will not be closed anymore. Also the file will not be deleted, however that may be fine.

I would rather go with making a more universal constructor i.e.CveApiJson20CveItemSource(Inputstream), alternatively guarded with an additional try-catch.

@philippn
Copy link
Author

@knalli Your suggestion is reasonable, I would indeed recommend to open a new issue though (and/or perhaps a pull request). I could also have a go at it, during the holidays if necessary 🙂

@knalli
Copy link
Contributor

knalli commented Dec 18, 2023

-> #6326

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core changes to core
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants