Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Improve hash collision avoidance in prefix caching #12621

Merged
merged 1 commit into from
Feb 4, 2025

Conversation

russellb
Copy link
Member

Prefix caching makes use of Python's built-in hash() function. As of
Python 3.12, the behavior of hash(None) has changed to be a
predictable constant value. This makes it more feasible that someone
could try exploit hash collisions.

The impact of a collision would be using cache that was generated using
different content. Given knowledge of prompts in use and predictable hashing
behavior, someone could intentionally populate the cache using a prompt
known to collide with another prompt in use. There doesn't seem to be much
value to an attacker in doing this, but it's certainly not ideal, and could
interfere with the accuracy of results for another user.

The invasiveness of this fix should be weighed against the severity of the
issue to determine whether this is worth fixing.

Using a hashing algorithm that is less prone to collision (like sha256, for
example) would be the best way to avoid the possibility of a collision.
However, it would have an impact to both performance and memory footprint.
An alternative is to continue to use hash(), but make it much more difficult
to predict the hash value.

What we want is that the starting hash value is randomized, which is the
behavior we got here prior to Python 3.12. An easy fix is to use a
string. Here we use 'None' to still make it clear we're starting from
nothing, but with a string we'll get a different hash value each time
vllm runs. Note that within a given run, the value will remain the same.
This restores the safer hashing behavior from before.

Thank you very much to @kexinoh for reporting this concern privately so
that it could be evaluated for its severity prior to our decision to fix
this as a security enhancement.

The commit that changed this behavior for Python 3.12 is here:

Signed-off-by: Russell Bryant rbryant@redhat.com

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

Copy link
Collaborator

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@comaniac comaniac added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 31, 2025
@comaniac comaniac enabled auto-merge (squash) January 31, 2025 17:25
@mgoin
Copy link
Member

mgoin commented Jan 31, 2025

Maybe this is still an issue with other hashers, but is there a reason why we don't use blake3 for hashing in the text case? It is currently what we use in the multimodal case for performance reasons AFAIK https://github.com/vllm-project/vllm/blob/e3f7ff65e7a6c08cd354f7f333bce543a4f0607e/vllm/multimodal/hasher.py

@comaniac
Copy link
Collaborator

Maybe this is still an issue with other hashers, but is there a reason why we don't use blake3 for hashing in the text case? It is currently what we use in the multimodal case for performance reasons AFAIK https://github.com/vllm-project/vllm/blob/e3f7ff65e7a6c08cd354f7f333bce543a4f0607e/vllm/multimodal/hasher.py

AFAIK it's just because hash has decent performance for short texts, but yeah we could benchmark blake3 in this scenario and see if we should use it here too.

Copy link

mergify bot commented Feb 1, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @russellb.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

auto-merge was automatically disabled February 3, 2025 15:31

Head branch was pushed to by a user without write access

@mergify mergify bot removed the needs-rebase label Feb 3, 2025
Copy link

mergify bot commented Feb 3, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @russellb.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Prefix caching makes use of Python's built-in `hash()` function. As of
Python 3.12, the behavior of `hash(None)` has changed to be a
predictable constant value. This makes it more feasible that someone
could try exploit hash collisions.

The impact of a collision would be using cache that was generated using
different content. Given knowledge of prompts in use and predictable hashing
behavior, someone could intentionally populate the cache using a prompt
known to collide with another prompt in use. There doesn't seem to be much
value to an attacker in doing this, but it's certainly not ideal, and could
interfere with the accuracy of results for another user.

The invasiveness of this fix should be weighed against the severity of the
issue to determine whether this is worth fixing.

Using a hashing algorithm that is less prone to collision (like sha256, for
example) would be the best way to avoid the possibility of a collision.
However, it would have an impact to both performance and memory footprint.
An alternative is to continue to use `hash()`, but make it much more difficult
to predict the hash value.

What we want is that the starting hash value is randomized, which is the
behavior we got here prior to Python 3.12. An easy fix is to use a
string. Here we use `'None'` to still make it clear we're starting from
nothing, but with a string we'll get a different hash value each time
vllm runs. Note that within a given run, the value will remain the same.
This restores the safer hashing behavior from before.

Thank you very much to @kexinoh for reporting this concern privately so
that it could be evaluated for its severity prior to our decision to fix
this as a security enhancement.

The commit that changed this behavior for Python 3.12 is here:

- python/cpython@432117c
- python/cpython#99541

Signed-off-by: Russell Bryant <rbryant@redhat.com>
@comaniac comaniac merged commit 73b35cc into vllm-project:main Feb 4, 2025
46 checks passed
@markmc
Copy link
Collaborator

markmc commented Feb 4, 2025

Does this change what's noted in the design doc for v1 ?

Note 2: The above hash key structure is not 100% collision free. Theoretically it’s still possible for the different prefix tokens to have the same hash value, but this should be nearly impossible to happen. Of course, contributions are welcome if you have an awesome idea to eliminate collusion entirely.

@comaniac
Copy link
Collaborator

comaniac commented Feb 4, 2025

Does this change what's noted in the design doc for v1 ?

No this PR mainly deals with the None case.

@russellb
Copy link
Member Author

russellb commented Feb 4, 2025

Does this change what's noted in the design doc for v1 ?

Note 2: The above hash key structure is not 100% collision free. Theoretically it’s still possible for the different prefix tokens to have the same hash value, but this should be nearly impossible to happen. Of course, contributions are welcome if you have an awesome idea to eliminate collusion entirely.

I think that's still accurate. Collisions are still possible, but what I was trying to avoid here is making it feasible to predict those collisions because of predictable hashing behavior.

@nFunctor
Copy link
Contributor

nFunctor commented Feb 4, 2025

Thanks for this PR @russellb . Perhaps if you have time for a semi-related question...

We observed a fairly weird effect during some intense generation, and perhaps it is related to this PR. The task involved a generation of summaries over an extensive batch, both prefix caching (the system prompts are fairly extensive) and n-gram speculative decoding were on. We ended up with "mixed summaries", eg a phrase like "London is the capital of" got replaced with "London is my favourite" (and both phrases existed in the inputs batch but for different batch indices).

I first thought that something went wrong with the speculative worker but I now start to think that perhaps it could have been due to prefix cache? Would you think the same? Apologies for the scarce details, unfortunately I am not yet sure if I can reproduce the exact circumstances of that generation experiment.

@kexinoh
Copy link

kexinoh commented Feb 4, 2025 via email

@russellb
Copy link
Member Author

russellb commented Feb 5, 2025

Thanks for this PR @russellb . Perhaps if you have time for a semi-related question...

We observed a fairly weird effect during some intense generation, and perhaps it is related to this PR. The task involved a generation of summaries over an extensive batch, both prefix caching (the system prompts are fairly extensive) and n-gram speculative decoding were on. We ended up with "mixed summaries", eg a phrase like "London is the capital of" got replaced with "London is my favourite" (and both phrases existed in the inputs batch but for different batch indices).

I first thought that something went wrong with the speculative worker but I now start to think that perhaps it could have been due to prefix cache? Would you think the same? Apologies for the scarce details, unfortunately I am not yet sure if I can reproduce the exact circumstances of that generation experiment.

Can you clarify if this was observed prior to this PR going in, or after? I want to make sure I didn't cause a regression.

If it was before, it's possible that you experienced a hash collision. Prior to this PR, using Python 3.12, that collision would be easily reproducible. After this PR, it would not. The conditions for a collision should be different every time vllm is run.

That doesn't remove the possibility for collisions, though. That would take more work. It's very interesting to hear that you may have observed this without going after it intentionally!

@nFunctor
Copy link
Contributor

nFunctor commented Feb 6, 2025

@russellb the issue happened with a docker build of vllm (0.6.5, and I believe all recent images run on 3.12) so it was observed before the PR. I have not done the tests with the nightly build/docker from source yet. Never seen such things happen outside docker in my python 3.11 venv.

I am not sure I can go into the significant detail about the setup where the bug was observed but here are some elements:

  • AWQ checkpoint of Llama 3.1 70B instruct running lots of requests. The GPU KV cache is often near 100%. The server is queried by a collection of workers whose load can vary.
  • Prefix caching, chunked prefill and n-gram speculative decoding on.
  • The issue involved multiple entries in queue getting confused (content switch) at a common word combo ("London is").

I would add that in my experience AWQ/Marlin-powered models have a rare tendency to produce wrong answers that consist of repeating strings, up to max tokens (I hope to find time to document this issue at some point, it is not easy to reproduce). From what you say it should not be related at all, but thought I'd mention it as a known issue with the setup; the latter is overall prone to some numerical instability even without the hash collision.

@ahanwate
Copy link

ahanwate commented Feb 7, 2025

@russellb Do we need a CVE for it and have you requested one already?

@russellb
Copy link
Member Author

russellb commented Feb 7, 2025

@russellb Do we need a CVE for it and have you requested one already?

A CVE was assigned and is reflected here: GHSA-rm76-4mrf-v9r8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants