Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support TensorRT-LLM? #12474

Open
npuichigo opened this issue Oct 28, 2023 · 10 comments
Open

Support TensorRT-LLM? #12474

npuichigo opened this issue Oct 28, 2023 · 10 comments
Labels
🤖:enhancement A large net-new component, integration, or chain. Use sparingly. The largest features Ɑ: models Related to LLMs or chat model modules

Comments

@npuichigo
Copy link

Feature request

Any plan to support nvidia's latest TensorRT-LLM, maybe via triton-inference-server backend?

Motivation

New integration

Your contribution

Test

@dosubot dosubot bot added Ɑ: models Related to LLMs or chat model modules 🤖:enhancement A large net-new component, integration, or chain. Use sparingly. The largest features labels Oct 28, 2023
@dosubot
Copy link

dosubot bot commented Oct 28, 2023

🤖

Hello @npuichigo,

It's good to see you again and thank you for your interest in LangChain.

As of the current version (v0.0.325), LangChain does not have any existing support or integration with NVIDIA's TensorRT or Triton Inference Server. The framework also does not have any dependencies or integrations with any NVIDIA software or libraries. You can confirm this by checking the project's dependencies in the pyproject.toml file.

However, the modular architecture of LangChain allows for the integration of a wide variety of tools, including language models. This is achieved through the use of "Chains", which are reusable components that encode a sequence of calls to components like models, document retrievers, other Chains, etc. You can find more details about this in the LangChain's architecture documentation.

Given this architecture, it's technically feasible to integrate TensorRT-LLM into the LangChain framework. However, the specific technical requirements and dependencies for such an integration, as well as the specific features of TensorRT-LLM that could enhance the functionality of LangChain, would need to be thoroughly evaluated.

Your suggestion to support TensorRT-LLM is valuable and I encourage you to provide more details about the potential benefits and use cases. This would help the LangChain community to better understand the need for this integration and to evaluate its feasibility and priority.

Looking forward to your insights.

Best,
Dosu

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@npuichigo
Copy link
Author

I invented a wheel to provide TensorRT LLM an OpenAI compatible API. Welcome to have a try https://github.com/npuichigo/openai_trtllm

@fciannella
Copy link

We have an official NVIDIA connector now. Still WIP: https://pypi.org/project/langchain-nvidia-trt/

@mkhludnev
Copy link
Contributor

We have an official NVIDIA connector now. Still WIP: https://pypi.org/project/langchain-nvidia-trt/

Thanks, @fciannella. Is https://github.com/langchain-ai/langchain/tree/master/libs/partners/nvidia-trt/langchain_nvidia_trt the source code for this package?

@fciannella
Copy link

fciannella commented Jan 25, 2024 via email

@mkhludnev
Copy link
Contributor

mkhludnev commented Jan 26, 2024

Feel free to provide any feedback!

Ok. Thanks. I evaluated it in my env. Here we go:

@mkhludnev
Copy link
Contributor

@fciannella btw #16749 is worth a release!

@mkhludnev
Copy link
Contributor

mkhludnev commented Feb 4, 2024

UPD

triton doesn't stop the request triton-inference-server/server#4818 I'll come up with PR #17764.

I use this code with vLLM model deployed with

backend: "vllm"

# Disabling batching in Triton, let vLLM handle the batching on its own.
max_batch_size: 0

# We need to use decoupled transaction policy for saturating
# vLLM engine for max throughtput.
model_transaction_policy {
  decoupled: True
}

It seems ok, but it doesn't support explicit stop signal.

def _send_stop_signals(self, model_name: str, request_id: str) -> None:

I had to change to force_batch=True, there's to argument to turn it properly.

Also, I have to pass enable_empty_final_response=True

self.client.async_stream_infer(
without it I triton_final_response is always False, the steam never ends.

@mkhludnev
Copy link
Contributor

mkhludnev commented Feb 4, 2024

The following problem resides not it this code base, but in integration technology.

TL;DR

Triton REST seems fine, but gRPRC, which is used by this code, seems prone to trimming LLM response.

UPD:

turns out vLLM backend accepts max_token in a specific form as a part of sampling_parameters tensor.
see https://github.com/triton-inference-server/vllm_backend/pull/34/files. It's worth to consider in #15045

mkhludnev added a commit to mkhludnev/vllm_backend that referenced this issue Feb 5, 2024
I'd rather created issue for discussion, but this repo doesn't have issues enabled. 
First of all such prepending seems redundant especially for long RAG prompts. Then, it's an actual problem since I notice that triton gRPC crops the long response. Curiously, REST doesn't crop payload and full concatenation of prompt and output arrives to client. 

Here I put more details of the issue langchain-ai/langchain#12474 (comment)
@jasonngap1
Copy link

We have an official NVIDIA connector now. Still WIP: https://pypi.org/project/langchain-nvidia-trt/

Hi, are there any updates on when this integration will be released please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:enhancement A large net-new component, integration, or chain. Use sparingly. The largest features Ɑ: models Related to LLMs or chat model modules
Projects
None yet
Development

No branches or pull requests

4 participants