-
-
Notifications
You must be signed in to change notification settings - Fork 6.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Improve hash collision avoidance in prefix caching #12621
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Maybe this is still an issue with other hashers, but is there a reason why we don't use blake3 for hashing in the text case? It is currently what we use in the multimodal case for performance reasons AFAIK https://github.com/vllm-project/vllm/blob/e3f7ff65e7a6c08cd354f7f333bce543a4f0607e/vllm/multimodal/hasher.py |
AFAIK it's just because |
This pull request has merge conflicts that must be resolved before it can be |
Head branch was pushed to by a user without write access
3dc5cf0
to
f0efd13
Compare
This pull request has merge conflicts that must be resolved before it can be |
65443ba
to
98b121a
Compare
Prefix caching makes use of Python's built-in `hash()` function. As of Python 3.12, the behavior of `hash(None)` has changed to be a predictable constant value. This makes it more feasible that someone could try exploit hash collisions. The impact of a collision would be using cache that was generated using different content. Given knowledge of prompts in use and predictable hashing behavior, someone could intentionally populate the cache using a prompt known to collide with another prompt in use. There doesn't seem to be much value to an attacker in doing this, but it's certainly not ideal, and could interfere with the accuracy of results for another user. The invasiveness of this fix should be weighed against the severity of the issue to determine whether this is worth fixing. Using a hashing algorithm that is less prone to collision (like sha256, for example) would be the best way to avoid the possibility of a collision. However, it would have an impact to both performance and memory footprint. An alternative is to continue to use `hash()`, but make it much more difficult to predict the hash value. What we want is that the starting hash value is randomized, which is the behavior we got here prior to Python 3.12. An easy fix is to use a string. Here we use `'None'` to still make it clear we're starting from nothing, but with a string we'll get a different hash value each time vllm runs. Note that within a given run, the value will remain the same. This restores the safer hashing behavior from before. Thank you very much to @kexinoh for reporting this concern privately so that it could be evaluated for its severity prior to our decision to fix this as a security enhancement. The commit that changed this behavior for Python 3.12 is here: - python/cpython@432117c - python/cpython#99541 Signed-off-by: Russell Bryant <rbryant@redhat.com>
98b121a
to
a032ee8
Compare
Does this change what's noted in the design doc for v1 ?
|
No this PR mainly deals with the None case. |
I think that's still accurate. Collisions are still possible, but what I was trying to avoid here is making it feasible to predict those collisions because of predictable hashing behavior. |
Thanks for this PR @russellb . Perhaps if you have time for a semi-related question... We observed a fairly weird effect during some intense generation, and perhaps it is related to this PR. The task involved a generation of summaries over an extensive batch, both prefix caching (the system prompts are fairly extensive) and n-gram speculative decoding were on. We ended up with "mixed summaries", eg a phrase like "London is the capital of" got replaced with "London is my favourite" (and both phrases existed in the inputs batch but for different batch indices). I first thought that something went wrong with the speculative worker but I now start to think that perhaps it could have been due to prefix cache? Would you think the same? Apologies for the scarce details, unfortunately I am not yet sure if I can reproduce the exact circumstances of that generation experiment. |
I am the discoverer of the problem, and we construct the phenomenon exactly as you say. Then I also need to add that hash(None) is not really a random value before Python3.12, but rather a memory address value (which makes it less random). In the case of Python VM, None may be set to 0.
…---Original---
From: "Edouard ***@***.***>
Date: Wed, Feb 5, 2025 06:23 AM
To: ***@***.***>;
Cc: ***@***.******@***.***>;
Subject: Re: [vllm-project/vllm] [Core] Improve hash collision avoidance inprefix caching (PR #12621)
Thanks for this PR @russellb . Perhaps if you have time for a semi-related question...
We observed a fairly weird effect during some intense generation, and perhaps it is related to this PR. The task involved a generation of summaries over an extensive batch, both prefix caching (the system prompts are fairly extensive) and n-gram speculative decoding were on. We ended up with "mixed summaries", eg a phrase like "London is the capital of" got replaced with "London is my favourite" (and both phrases existed in the inputs batch but for different batch indices).
I first thought that something went wrong with the speculative worker but I now start to think that perhaps it could have been due to prefix cache? Would you think the same? Apologies for the scarce details, unfortunately I am not yet sure if I can reproduce the exact circumstances of that generation experiment.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Can you clarify if this was observed prior to this PR going in, or after? I want to make sure I didn't cause a regression. If it was before, it's possible that you experienced a hash collision. Prior to this PR, using Python 3.12, that collision would be easily reproducible. After this PR, it would not. The conditions for a collision should be different every time vllm is run. That doesn't remove the possibility for collisions, though. That would take more work. It's very interesting to hear that you may have observed this without going after it intentionally! |
@russellb the issue happened with a docker build of vllm (0.6.5, and I believe all recent images run on 3.12) so it was observed before the PR. I have not done the tests with the nightly build/docker from source yet. Never seen such things happen outside docker in my python 3.11 venv. I am not sure I can go into the significant detail about the setup where the bug was observed but here are some elements:
I would add that in my experience AWQ/Marlin-powered models have a rare tendency to produce wrong answers that consist of repeating strings, up to max tokens (I hope to find time to document this issue at some point, it is not easy to reproduce). From what you say it should not be related at all, but thought I'd mention it as a known issue with the setup; the latter is overall prone to some numerical instability even without the hash collision. |
@russellb Do we need a CVE for it and have you requested one already? |
A CVE was assigned and is reflected here: GHSA-rm76-4mrf-v9r8 |
Prefix caching makes use of Python's built-in
hash()
function. As ofPython 3.12, the behavior of
hash(None)
has changed to be apredictable constant value. This makes it more feasible that someone
could try exploit hash collisions.
The impact of a collision would be using cache that was generated using
different content. Given knowledge of prompts in use and predictable hashing
behavior, someone could intentionally populate the cache using a prompt
known to collide with another prompt in use. There doesn't seem to be much
value to an attacker in doing this, but it's certainly not ideal, and could
interfere with the accuracy of results for another user.
The invasiveness of this fix should be weighed against the severity of the
issue to determine whether this is worth fixing.
Using a hashing algorithm that is less prone to collision (like sha256, for
example) would be the best way to avoid the possibility of a collision.
However, it would have an impact to both performance and memory footprint.
An alternative is to continue to use
hash()
, but make it much more difficultto predict the hash value.
What we want is that the starting hash value is randomized, which is the
behavior we got here prior to Python 3.12. An easy fix is to use a
string. Here we use
'None'
to still make it clear we're starting fromnothing, but with a string we'll get a different hash value each time
vllm runs. Note that within a given run, the value will remain the same.
This restores the safer hashing behavior from before.
Thank you very much to @kexinoh for reporting this concern privately so
that it could be evaluated for its severity prior to our decision to fix
this as a security enhancement.
The commit that changed this behavior for Python 3.12 is here:
Signed-off-by: Russell Bryant rbryant@redhat.com