-
Notifications
You must be signed in to change notification settings - Fork 471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for Async Embeddings via michaelfeil/infinity #596
Comments
Hey @michaelfeil , thank you for sharing this man. We would love to have this, are you interested in working on this? |
@shahules786 Perhaps I'll just push the pure python / async into langchain directly, then it should be reusable, right? |
he @michaelfeil this would be awesome, like you said if you drop something in langchain, that will be the easiest for you in terms of time spent. What we would love to do is build an integration doc with infinity and showcase how fast it is and how it improves people who are using Ragas as well, hopefully, driving some traffic your way. if you check this section we use embed a lot of chunks in sequence and is limiting on how your embedding is served. Maybe we can do a comparison here? would that be something interested? ragas/src/ragas/testset/docstore.py Lines 229 to 250 in 27e1c24
we can do other comparisons too but the LLM is the limiting factor for performance so there won't be a lot of diff but the above usecase would be solid for a comparison let me know if its something that interests you :) |
Would be interesting. Fyi, I added the PR for langchain here, took me some hours over the weekend, hope its getting merged soon. langchain-ai/langchain#17671 I would not recommend submitting the nodes (assuming each node has 1 sentence) with ThreadPoolExecutor. At a minimum batch the requests, this will help whatever backend, even API's. Also is using |
FYI all the thing is now finally in langchain (community, see PR mentined above). Also, you might be interested in in https://github.com/michaelfeil/infinity/blob/1fe3a34e295c95fc4a75297de842ec55c6761457/docs/benchmarks/benchmarking.md for benchmarking. |
@jjmachan It should be now in some versions of langchain. |
Hey all, looking forward to contribute this. |
Nah, not stale! |
I am still waiting for a freaking PR review |
Describe the Feature
I would like to integrate https://github.com/michaelfeil/infinity for embeddings inference. This would automatically batch up concurrent request, uses flash-attention2, compatible with cuda, rocm, apple mps and cpu.
Depending on the usage, you might expect between a 2.5x-22x throughput improvment / speedup over using the default hf embeddings langchain code.
The text was updated successfully, but these errors were encountered: