Support TensorRT-LLM? #12474

npuichigo · 2023-10-28T05:04:42Z

Feature request

Any plan to support nvidia's latest TensorRT-LLM, maybe via triton-inference-server backend?

Motivation

New integration

Your contribution

Test

dosubot · 2023-10-28T05:18:18Z

🤖

Hello @npuichigo,

It's good to see you again and thank you for your interest in LangChain.

As of the current version (v0.0.325), LangChain does not have any existing support or integration with NVIDIA's TensorRT or Triton Inference Server. The framework also does not have any dependencies or integrations with any NVIDIA software or libraries. You can confirm this by checking the project's dependencies in the pyproject.toml file.

However, the modular architecture of LangChain allows for the integration of a wide variety of tools, including language models. This is achieved through the use of "Chains", which are reusable components that encode a sequence of calls to components like models, document retrievers, other Chains, etc. You can find more details about this in the LangChain's architecture documentation.

Given this architecture, it's technically feasible to integrate TensorRT-LLM into the LangChain framework. However, the specific technical requirements and dependencies for such an integration, as well as the specific features of TensorRT-LLM that could enhance the functionality of LangChain, would need to be thoroughly evaluated.

Your suggestion to support TensorRT-LLM is valuable and I encourage you to provide more details about the potential benefits and use cases. This would help the LangChain community to better understand the need for this integration and to evaluate its feasibility and priority.

Looking forward to your insights.

Best,
Dosu

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

npuichigo · 2023-12-07T14:11:14Z

I invented a wheel to provide TensorRT LLM an OpenAI compatible API. Welcome to have a try https://github.com/npuichigo/openai_trtllm

fciannella · 2023-12-22T02:32:52Z

We have an official NVIDIA connector now. Still WIP: https://pypi.org/project/langchain-nvidia-trt/

mkhludnev · 2024-01-25T20:05:12Z

We have an official NVIDIA connector now. Still WIP: https://pypi.org/project/langchain-nvidia-trt/

Thanks, @fciannella. Is https://github.com/langchain-ai/langchain/tree/master/libs/partners/nvidia-trt/langchain_nvidia_trt the source code for this package?

fciannella · 2024-01-25T23:48:04Z

Yes! We are still working on a bug fix before we push more documentation and announce it officially. Feel free to provide any feedback!

…

On Thu, Jan 25, 2024 at 12:05 PM Mikhail Khludnev ***@***.***> wrote: We have an official NVIDIA connector now. Still WIP: https://pypi.org/project/langchain-nvidia-trt/ Thanks, @fciannella <https://github.com/fciannella>. Is https://github.com/langchain-ai/langchain/tree/master/libs/partners/nvidia-trt/langchain_nvidia_trt the source code for this package? — Reply to this email directly, view it on GitHub <#12474 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACLQL2M5XSQ6SFVGTW35VRLYQK3IFAVCNFSM6AAAAAA6TWKW36VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJQHEYDQMJTGU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

mkhludnev · 2024-01-26T07:25:46Z

Feel free to provide any feedback!

Ok. Thanks. I evaluated it in my env. Here we go:

verbose flag would be quite helpful to propagate for debugging UPD PR nvidia-trt:add TritonTensorRTLLM(verbose_client=False) #16848
there's cuda-python dependency but there's no need in it for client access, and no way to install it on macos. I'm not sure about the tests, but code itself just runs. UPD it's just resolved nvidia-trt: remove tritonclient all extra dep #16749 thanks efriis Appreciate!
when it receives an exception it fails on casting to string UPD nvidia-trt: propagate InferenceClientException to the caller. #16936
the hardest problem - adjusting to the model parameters, I had to edit llms.py, but presumably (don't know exactly) the code might obtain the model's parameters via API. NVIDIA Triton+TRT-LLM connector needs to handle dynamic model parameters #15045 ?
UPD Nvidia trt model name for stop_stream() #16997

mkhludnev · 2024-02-01T07:40:43Z

@fciannella btw #16749 is worth a release!

mkhludnev · 2024-02-04T10:24:29Z

UPD

triton doesn't stop the request triton-inference-server/server#4818 I'll come up with PR #17764.

I use this code with vLLM model deployed with

backend: "vllm"

# Disabling batching in Triton, let vLLM handle the batching on its own.
max_batch_size: 0

# We need to use decoupled transaction policy for saturating
# vLLM engine for max throughtput.
model_transaction_policy {
  decoupled: True
}

It seems ok, but it doesn't support explicit stop signal.

langchain/libs/partners/nvidia-trt/langchain_nvidia_trt/llms.py

Line 299 in 8490511

def _send_stop_signals(self, model_name: str, request_id: str) -> None:

I had to change

langchain/libs/partners/nvidia-trt/langchain_nvidia_trt/llms.py

Line 218 in 8490511

force_batch=False,

to force_batch=True, there's to argument to turn it properly.

Also, I have to pass enable_empty_final_response=True

langchain/libs/partners/nvidia-trt/langchain_nvidia_trt/llms.py

Line 233 in 8490511

self.client.async_stream_infer(

without it I triton_final_response is always False, the steam never ends.

mkhludnev · 2024-02-04T21:07:45Z

The following problem resides not it this code base, but in integration technology.

TL;DR

Triton REST seems fine, but gRPRC, which is used by this code, seems prone to trimming LLM response.

UPD:

turns out vLLM backend accepts max_token in a specific form as a part of sampling_parameters tensor.
see https://github.com/triton-inference-server/vllm_backend/pull/34/files. It's worth to consider in #15045

I'd rather created issue for discussion, but this repo doesn't have issues enabled. First of all such prepending seems redundant especially for long RAG prompts. Then, it's an actual problem since I notice that triton gRPC crops the long response. Curiously, REST doesn't crop payload and full concatenation of prompt and output arrives to client. Here I put more details of the issue langchain-ai/langchain#12474 (comment)

jasonngap1 · 2024-04-02T05:42:46Z

We have an official NVIDIA connector now. Still WIP: https://pypi.org/project/langchain-nvidia-trt/

Hi, are there any updates on when this integration will be released please?

dosubot bot added Ɑ: models Related to LLMs or chat model modules 🤖:enhancement A large net-new component, integration, or chain. Use sparingly. The largest features labels Oct 28, 2023

peterableda mentioned this issue Dec 3, 2023

Triton LangChain Provider triton-inference-server/server#6655

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support TensorRT-LLM? #12474

Support TensorRT-LLM? #12474

npuichigo commented Oct 28, 2023

dosubot bot commented Oct 28, 2023

npuichigo commented Dec 7, 2023

fciannella commented Dec 22, 2023

mkhludnev commented Jan 25, 2024

fciannella commented Jan 25, 2024 via email

mkhludnev commented Jan 26, 2024 •

edited

mkhludnev commented Feb 1, 2024

mkhludnev commented Feb 4, 2024 •

edited

mkhludnev commented Feb 4, 2024 •

edited

jasonngap1 commented Apr 2, 2024

Support TensorRT-LLM? #12474

Support TensorRT-LLM? #12474

Comments

npuichigo commented Oct 28, 2023

Feature request

Motivation

Your contribution

dosubot bot commented Oct 28, 2023

Sources

npuichigo commented Dec 7, 2023

fciannella commented Dec 22, 2023

mkhludnev commented Jan 25, 2024

fciannella commented Jan 25, 2024 via email

mkhludnev commented Jan 26, 2024 • edited

mkhludnev commented Feb 1, 2024

mkhludnev commented Feb 4, 2024 • edited

UPD

mkhludnev commented Feb 4, 2024 • edited

TL;DR

UPD:

jasonngap1 commented Apr 2, 2024

mkhludnev commented Jan 26, 2024 •

edited

mkhludnev commented Feb 4, 2024 •

edited

mkhludnev commented Feb 4, 2024 •

edited