-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LlaVa model in transformers #25060
Comments
Hi @RajeshRadha Thank you for the feature request. As @ArthurZucker mentioning to me, the repo. has reached 4K starts and 300 fork, it seems this is quite popular. Will leave our core maintainers @amyeroberts and @sgugger to see if this qualifies the model to be in |
Given the popularity and performance of the model, I think it'd be a good addition into @RajeshRadha if you'd like to add the model, feel free to open a PR and tag @ArthurZucker and myself for review. |
Any update about this model? #23849 is closed and unactivated. |
cc @rafaelpadilla and @amyeroberts if one of you has the bandwidth |
I won't have time unfortunately before I'm off :( If @rafaelpadilla or anyone in the community would like to add this model - it would be a great addition! |
PR will be merged the coming week 😉 |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
#27662 closes this |
This is a great integration. As a further step, it would be great to have an API for multi-modal models. I think it's unlikely TGI (see here) or vLLM would integrate multi-modal as it's too different. There is a (closed) PR on the Llava project that allows for a simple single-call API. Possibly building on that is a good way to go. A key feature I see as valuable is continuous batching, this is what really allows devs to spin up a multi-modal end point for production. Questions
|
Thanks @RonanKMcGovern for your feedback ! I think TGI could support multi-modal models as they did it in the past with idefics if I am not mistaken cc @OlivierDehaene |
Thanks @younesbelkada that makes sense intuitively. IDEFIX (flamenco style models) have a single tokenizer, whether it's image or text (if I'm not mistaken) so that makes it easier plug and play for TFI. I see that as a pretty significant advantage. With an a good inference endpoint, llava just isn't as useful because devs can't use it well in production. I need to read more on why llava 1.6 is stronger than IDEFIX. I guess IDEFIX has the drawback that it had to be entirely trained from scratch. Makes me wonder whether it would have been better to take an IDEFIX approach in making Llava. |
Feature request
Support to Llava model in transformers? https://github.com/haotian-liu/LLaVA Similar to InstructBlip w/ connector module between image embeddings and LLM's
Motivation
Llava is performing really well in MLLM related tasks and for folks to try out InstructBlip vs Llava models it makes it easier if it's in hugging face as it's mostly using the same Image Encoder embeddings from (EVA or ViT or CLIP) and foundational models from (T5 or Vicuna or Llama-2). Code maintenance and ease of integration is easy
Your contribution
I can definitely help with a PR or tag along with folks in hugging face to make it happen
The text was updated successfully, but these errors were encountered: