[Core] Improve hash collision avoidance in prefix caching #12621

russellb · 2025-01-31T17:19:45Z

Prefix caching makes use of Python's built-in hash() function. As of
Python 3.12, the behavior of hash(None) has changed to be a
predictable constant value. This makes it more feasible that someone
could try exploit hash collisions.

The impact of a collision would be using cache that was generated using
different content. Given knowledge of prompts in use and predictable hashing
behavior, someone could intentionally populate the cache using a prompt
known to collide with another prompt in use. There doesn't seem to be much
value to an attacker in doing this, but it's certainly not ideal, and could
interfere with the accuracy of results for another user.

The invasiveness of this fix should be weighed against the severity of the
issue to determine whether this is worth fixing.

Using a hashing algorithm that is less prone to collision (like sha256, for
example) would be the best way to avoid the possibility of a collision.
However, it would have an impact to both performance and memory footprint.
An alternative is to continue to use hash(), but make it much more difficult
to predict the hash value.

What we want is that the starting hash value is randomized, which is the
behavior we got here prior to Python 3.12. An easy fix is to use a
string. Here we use 'None' to still make it clear we're starting from
nothing, but with a string we'll get a different hash value each time
vllm runs. Note that within a given run, the value will remain the same.
This restores the safer hashing behavior from before.

Thank you very much to @kexinoh for reporting this concern privately so
that it could be evaluated for its severity prior to our decision to fix
this as a security enhancement.

The commit that changed this behavior for Python 3.12 is here:

Signed-off-by: Russell Bryant rbryant@redhat.com

github-actions · 2025-01-31T17:19:56Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

comaniac

LGTM

mgoin · 2025-01-31T17:37:00Z

Maybe this is still an issue with other hashers, but is there a reason why we don't use blake3 for hashing in the text case? It is currently what we use in the multimodal case for performance reasons AFAIK https://github.com/vllm-project/vllm/blob/e3f7ff65e7a6c08cd354f7f333bce543a4f0607e/vllm/multimodal/hasher.py

comaniac · 2025-01-31T17:45:09Z

Maybe this is still an issue with other hashers, but is there a reason why we don't use blake3 for hashing in the text case? It is currently what we use in the multimodal case for performance reasons AFAIK https://github.com/vllm-project/vllm/blob/e3f7ff65e7a6c08cd354f7f333bce543a4f0607e/vllm/multimodal/hasher.py

AFAIK it's just because hash has decent performance for short texts, but yeah we could benchmark blake3 in this scenario and see if we should use it here too.

mergify · 2025-02-01T05:40:17Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @russellb.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2025-02-03T19:17:46Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @russellb.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@kexinoh

Prefix caching makes use of Python's built-in `hash()` function. As of Python 3.12, the behavior of `hash(None)` has changed to be a predictable constant value. This makes it more feasible that someone could try exploit hash collisions. The impact of a collision would be using cache that was generated using different content. Given knowledge of prompts in use and predictable hashing behavior, someone could intentionally populate the cache using a prompt known to collide with another prompt in use. There doesn't seem to be much value to an attacker in doing this, but it's certainly not ideal, and could interfere with the accuracy of results for another user. The invasiveness of this fix should be weighed against the severity of the issue to determine whether this is worth fixing. Using a hashing algorithm that is less prone to collision (like sha256, for example) would be the best way to avoid the possibility of a collision. However, it would have an impact to both performance and memory footprint. An alternative is to continue to use `hash()`, but make it much more difficult to predict the hash value. What we want is that the starting hash value is randomized, which is the behavior we got here prior to Python 3.12. An easy fix is to use a string. Here we use `'None'` to still make it clear we're starting from nothing, but with a string we'll get a different hash value each time vllm runs. Note that within a given run, the value will remain the same. This restores the safer hashing behavior from before. Thank you very much to @kexinoh for reporting this concern privately so that it could be evaluated for its severity prior to our decision to fix this as a security enhancement. The commit that changed this behavior for Python 3.12 is here: - python/cpython@432117c - python/cpython#99541 Signed-off-by: Russell Bryant <rbryant@redhat.com>

markmc · 2025-02-04T17:56:17Z

Does this change what's noted in the design doc for v1 ?

Note 2: The above hash key structure is not 100% collision free. Theoretically it’s still possible for the different prefix tokens to have the same hash value, but this should be nearly impossible to happen. Of course, contributions are welcome if you have an awesome idea to eliminate collusion entirely.

comaniac · 2025-02-04T18:04:33Z

Does this change what's noted in the design doc for v1 ?

No this PR mainly deals with the None case.

russellb · 2025-02-04T18:05:54Z

Does this change what's noted in the design doc for v1 ?

Note 2: The above hash key structure is not 100% collision free. Theoretically it’s still possible for the different prefix tokens to have the same hash value, but this should be nearly impossible to happen. Of course, contributions are welcome if you have an awesome idea to eliminate collusion entirely.

I think that's still accurate. Collisions are still possible, but what I was trying to avoid here is making it feasible to predict those collisions because of predictable hashing behavior.

nFunctor · 2025-02-04T22:23:23Z

Thanks for this PR @russellb . Perhaps if you have time for a semi-related question...

We observed a fairly weird effect during some intense generation, and perhaps it is related to this PR. The task involved a generation of summaries over an extensive batch, both prefix caching (the system prompts are fairly extensive) and n-gram speculative decoding were on. We ended up with "mixed summaries", eg a phrase like "London is the capital of" got replaced with "London is my favourite" (and both phrases existed in the inputs batch but for different batch indices).

I first thought that something went wrong with the speculative worker but I now start to think that perhaps it could have been due to prefix cache? Would you think the same? Apologies for the scarce details, unfortunately I am not yet sure if I can reproduce the exact circumstances of that generation experiment.

kexinoh · 2025-02-04T23:25:57Z

I am the discoverer of the problem, and we construct the phenomenon exactly as you say. Then I also need to add that hash(None) is not really a random value before Python3.12, but rather a memory address value (which makes it less random). In the case of Python VM, None may be set to 0.

…

---Original--- From: "Edouard ***@***.***> Date: Wed, Feb 5, 2025 06:23 AM To: ***@***.***>; Cc: ***@***.******@***.***>; Subject: Re: [vllm-project/vllm] [Core] Improve hash collision avoidance inprefix caching (PR #12621) Thanks for this PR @russellb . Perhaps if you have time for a semi-related question... We observed a fairly weird effect during some intense generation, and perhaps it is related to this PR. The task involved a generation of summaries over an extensive batch, both prefix caching (the system prompts are fairly extensive) and n-gram speculative decoding were on. We ended up with "mixed summaries", eg a phrase like "London is the capital of" got replaced with "London is my favourite" (and both phrases existed in the inputs batch but for different batch indices). I first thought that something went wrong with the speculative worker but I now start to think that perhaps it could have been due to prefix cache? Would you think the same? Apologies for the scarce details, unfortunately I am not yet sure if I can reproduce the exact circumstances of that generation experiment. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

russellb · 2025-02-05T15:06:04Z

Thanks for this PR @russellb . Perhaps if you have time for a semi-related question...

We observed a fairly weird effect during some intense generation, and perhaps it is related to this PR. The task involved a generation of summaries over an extensive batch, both prefix caching (the system prompts are fairly extensive) and n-gram speculative decoding were on. We ended up with "mixed summaries", eg a phrase like "London is the capital of" got replaced with "London is my favourite" (and both phrases existed in the inputs batch but for different batch indices).

I first thought that something went wrong with the speculative worker but I now start to think that perhaps it could have been due to prefix cache? Would you think the same? Apologies for the scarce details, unfortunately I am not yet sure if I can reproduce the exact circumstances of that generation experiment.

Can you clarify if this was observed prior to this PR going in, or after? I want to make sure I didn't cause a regression.

If it was before, it's possible that you experienced a hash collision. Prior to this PR, using Python 3.12, that collision would be easily reproducible. After this PR, it would not. The conditions for a collision should be different every time vllm is run.

That doesn't remove the possibility for collisions, though. That would take more work. It's very interesting to hear that you may have observed this without going after it intentionally!

nFunctor · 2025-02-06T13:33:20Z

@russellb the issue happened with a docker build of vllm (0.6.5, and I believe all recent images run on 3.12) so it was observed before the PR. I have not done the tests with the nightly build/docker from source yet. Never seen such things happen outside docker in my python 3.11 venv.

I am not sure I can go into the significant detail about the setup where the bug was observed but here are some elements:

AWQ checkpoint of Llama 3.1 70B instruct running lots of requests. The GPU KV cache is often near 100%. The server is queried by a collection of workers whose load can vary.
Prefix caching, chunked prefill and n-gram speculative decoding on.
The issue involved multiple entries in queue getting confused (content switch) at a common word combo ("London is").

I would add that in my experience AWQ/Marlin-powered models have a rare tendency to produce wrong answers that consist of repeating strings, up to max tokens (I hope to find time to document this issue at some point, it is not easy to reproduce). From what you say it should not be related at all, but thought I'd mention it as a known issue with the setup; the latter is overall prone to some numerical instability even without the hash collision.

ahanwate · 2025-02-07T10:41:11Z

@russellb Do we need a CVE for it and have you requested one already?

russellb · 2025-02-07T18:05:41Z

@russellb Do we need a CVE for it and have you requested one already?

A CVE was assigned and is reflected here: GHSA-rm76-4mrf-v9r8

russellb requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac, alexm-redhat, zhuohan123 and youkaichao as code owners January 31, 2025 17:19

comaniac approved these changes Jan 31, 2025

View reviewed changes

comaniac added the ready label Jan 31, 2025

comaniac enabled auto-merge (squash) January 31, 2025 17:25

mergify bot added needs-rebase v1 labels Feb 1, 2025

auto-merge was automatically disabled February 3, 2025 15:31
Head branch was pushed to by a user without write access

russellb force-pushed the hash-collisions branch from 3dc5cf0 to f0efd13 Compare February 3, 2025 15:31

mergify bot removed the needs-rebase label Feb 3, 2025

russellb requested a review from DarkLight1337 as a code owner February 3, 2025 15:42

mergify bot added the needs-rebase label Feb 3, 2025

russellb force-pushed the hash-collisions branch from 65443ba to 98b121a Compare February 3, 2025 19:20

mergify bot removed the needs-rebase label Feb 3, 2025

mgoin approved these changes Feb 3, 2025

View reviewed changes

russellb force-pushed the hash-collisions branch from 98b121a to a032ee8 Compare February 3, 2025 21:41

comaniac merged commit 73b35cc into vllm-project:main Feb 4, 2025
46 checks passed

liuyanyi mentioned this pull request Mar 18, 2025

[Bug]: Weird output when server with high load #14491

Open

1 task

dr75 mentioned this pull request Mar 21, 2025

Support SHA256 as hash function in prefix caching #15297

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

[Core] Improve hash collision avoidance in prefix caching #12621

[Core] Improve hash collision avoidance in prefix caching #12621

russellb commented Jan 31, 2025

github-actions bot commented Jan 31, 2025

comaniac left a comment

mgoin commented Jan 31, 2025

comaniac commented Jan 31, 2025

mergify bot commented Feb 1, 2025

mergify bot commented Feb 3, 2025

markmc commented Feb 4, 2025

comaniac commented Feb 4, 2025

russellb commented Feb 4, 2025

nFunctor commented Feb 4, 2025

kexinoh commented Feb 4, 2025 via email

russellb commented Feb 5, 2025

nFunctor commented Feb 6, 2025

ahanwate commented Feb 7, 2025

russellb commented Feb 7, 2025

[Core] Improve hash collision avoidance in prefix caching #12621

[Core] Improve hash collision avoidance in prefix caching #12621

Conversation

russellb commented Jan 31, 2025

github-actions bot commented Jan 31, 2025

comaniac left a comment

Choose a reason for hiding this comment

mgoin commented Jan 31, 2025

comaniac commented Jan 31, 2025

mergify bot commented Feb 1, 2025

mergify bot commented Feb 3, 2025

markmc commented Feb 4, 2025

comaniac commented Feb 4, 2025

russellb commented Feb 4, 2025

nFunctor commented Feb 4, 2025

kexinoh commented Feb 4, 2025 via email

russellb commented Feb 5, 2025

nFunctor commented Feb 6, 2025

ahanwate commented Feb 7, 2025

russellb commented Feb 7, 2025