Support for Async Embeddings via michaelfeil/infinity #596

michaelfeil · 2024-02-13T08:03:12Z

Describe the Feature
I would like to integrate https://github.com/michaelfeil/infinity for embeddings inference. This would automatically batch up concurrent request, uses flash-attention2, compatible with cuda, rocm, apple mps and cpu.
Depending on the usage, you might expect between a 2.5x-22x throughput improvment / speedup over using the default hf embeddings langchain code.

shahules786 · 2024-02-13T17:48:45Z

Hey @michaelfeil , thank you for sharing this man. We would love to have this, are you interested in working on this?

michaelfeil · 2024-02-13T18:02:55Z

@shahules786 Perhaps I'll just push the pure python / async into langchain directly, then it should be reusable, right?

https://github.com/explodinggradients/ragas/blob/41c0c286ae3632a77db13ddd265b7699fe6a4adc/src/ragas/embeddings/base.py#L45C1-L78C61

jjmachan · 2024-02-15T22:39:35Z

he @michaelfeil this would be awesome, like you said if you drop something in langchain, that will be the easiest for you in terms of time spent. What we would love to do is build an integration doc with infinity and showcase how fast it is and how it improves people who are using Ragas as well, hopefully, driving some traffic your way.

if you check this section we use embed a lot of chunks in sequence and is limiting on how your embedding is served. Maybe we can do a comparison here? would that be something interested?

ragas/src/ragas/testset/docstore.py

Lines 229 to 250 in 27e1c24

    
           for i, n in enumerate(nodes): 
        
               if n.embedding is None: 
        
                   nodes_to_embed.update({i: result_idx}) 
        
                   executor.submit( 
        
                       self.embeddings.embed_text, 
        
                       n.page_content, 
        
                       name=f"embed_node_task[{i}]", 
        
                   ) 
        
                   result_idx += 1 
        
               if n.keyphrases == []: 
        
                   nodes_to_extract.update({i: result_idx}) 
        
                   executor.submit( 
        
                       self.extractor.extract, 
        
                       n, 
        
                       name=f"keyphrase-extraction[{i}]", 
        
                   ) 
        
                   result_idx += 1 
        
           results = executor.results() 
        
           if results == []: 
        
               raise ExceptionInRunner()

we can do other comparisons too but the LLM is the limiting factor for performance so there won't be a lot of diff but the above usecase would be solid for a comparison

let me know if its something that interests you :)

michaelfeil · 2024-02-20T03:02:08Z

Would be interesting. Fyi, I added the PR for langchain here, took me some hours over the weekend, hope its getting merged soon. langchain-ai/langchain#17671

I would not recommend submitting the nodes (assuming each node has 1 sentence) with ThreadPoolExecutor. At a minimum batch the requests, this will help whatever backend, even API's.

Also is using async def an option for the function you linked above @jjmachan ?

michaelfeil · 2024-02-22T05:55:35Z

FYI all the thing is now finally in langchain (community, see PR mentined above). Also, you might be interested in in https://github.com/michaelfeil/infinity/blob/1fe3a34e295c95fc4a75297de842ec55c6761457/docs/benchmarks/benchmarking.md for benchmarking.

michaelfeil · 2024-02-26T06:33:17Z

@jjmachan It should be now in some versions of langchain.

michaelfeil · 2024-04-02T04:57:09Z

Hey all, looking forward to contribute this.

michaelfeil · 2024-06-01T00:49:25Z

Nah, not stale!

michaelfeil · 2024-06-01T00:49:38Z

I am still waiting for a freaking PR review

michaelfeil added the enhancement New feature or request label Feb 13, 2024

michaelfeil mentioned this issue Apr 2, 2024

contribution: infinity-integration #834

Open

dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label May 19, 2024

dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 1, 2024

dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jun 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Async Embeddings via michaelfeil/infinity #596

Support for Async Embeddings via michaelfeil/infinity #596

michaelfeil commented Feb 13, 2024

shahules786 commented Feb 13, 2024

michaelfeil commented Feb 13, 2024

jjmachan commented Feb 15, 2024

michaelfeil commented Feb 20, 2024

michaelfeil commented Feb 22, 2024 •

edited

michaelfeil commented Feb 26, 2024

michaelfeil commented Apr 2, 2024

michaelfeil commented Jun 1, 2024

michaelfeil commented Jun 1, 2024

Support for Async Embeddings via michaelfeil/infinity #596

Support for Async Embeddings via michaelfeil/infinity #596

Comments

michaelfeil commented Feb 13, 2024

shahules786 commented Feb 13, 2024

michaelfeil commented Feb 13, 2024

jjmachan commented Feb 15, 2024

michaelfeil commented Feb 20, 2024

michaelfeil commented Feb 22, 2024 • edited

michaelfeil commented Feb 26, 2024

michaelfeil commented Apr 2, 2024

michaelfeil commented Jun 1, 2024

michaelfeil commented Jun 1, 2024

michaelfeil commented Feb 22, 2024 •

edited