Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LlaVa model in transformers #25060

Closed
RajeshRadha opened this issue Jul 24, 2023 · 12 comments
Closed

LlaVa model in transformers #25060

RajeshRadha opened this issue Jul 24, 2023 · 12 comments

Comments

@RajeshRadha
Copy link

Feature request

Support to Llava model in transformers? https://github.com/haotian-liu/LLaVA Similar to InstructBlip w/ connector module between image embeddings and LLM's

Motivation

Llava is performing really well in MLLM related tasks and for folks to try out InstructBlip vs Llava models it makes it easier if it's in hugging face as it's mostly using the same Image Encoder embeddings from (EVA or ViT or CLIP) and foundational models from (T5 or Vicuna or Llama-2). Code maintenance and ease of integration is easy

Your contribution

I can definitely help with a PR or tag along with folks in hugging face to make it happen

@RajeshRadha RajeshRadha changed the title LlaVa model in trwansformers LlaVa model in transformers Jul 24, 2023
@ydshieh ydshieh self-assigned this Jul 26, 2023
@ydshieh
Copy link
Collaborator

ydshieh commented Jul 26, 2023

Hi @RajeshRadha Thank you for the feature request.

As @ArthurZucker mentioning to me, the repo. has reached 4K starts and 300 fork, it seems this is quite popular.

Will leave our core maintainers @amyeroberts and @sgugger to see if this qualifies the model to be in transformers or we still prefer to have it first on the Hub.

@amyeroberts
Copy link
Collaborator

Given the popularity and performance of the model, I think it'd be a good addition into transformers :)

@RajeshRadha if you'd like to add the model, feel free to open a PR and tag @ArthurZucker and myself for review.

@ArthurZucker
Copy link
Collaborator

Just for reference, before the model got so popular, #22848 and #23849 were opened!

@ZeguanXiao
Copy link

Any update about this model? #23849 is closed and unactivated.

@ArthurZucker
Copy link
Collaborator

cc @rafaelpadilla and @amyeroberts if one of you has the bandwidth

@amyeroberts
Copy link
Collaborator

I won't have time unfortunately before I'm off :( If @rafaelpadilla or anyone in the community would like to add this model - it would be a great addition!

@huggingface huggingface deleted a comment from github-actions bot Sep 19, 2023
@shauray8 shauray8 mentioned this issue Sep 25, 2023
4 tasks
@huggingface huggingface deleted a comment from github-actions bot Oct 14, 2023
@huggingface huggingface deleted a comment from github-actions bot Nov 8, 2023
@huggingface huggingface deleted a comment from github-actions bot Dec 3, 2023
@ArthurZucker
Copy link
Collaborator

PR will be merged the coming week 😉

@ydshieh ydshieh removed their assignment Dec 4, 2023
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@ArthurZucker
Copy link
Collaborator

#27662 closes this

@RonanKMcGovern
Copy link

This is a great integration. As a further step, it would be great to have an API for multi-modal models.

I think it's unlikely TGI (see here) or vLLM would integrate multi-modal as it's too different.

There is a (closed) PR on the Llava project that allows for a simple single-call API. Possibly building on that is a good way to go.

A key feature I see as valuable is continuous batching, this is what really allows devs to spin up a multi-modal end point for production.

Questions

  • Is it too much of a stretch to try and add continuous batching to transformers? I'm guessing yes, because for LLMs, that has been offloaded to TGI.
  • Are there other angles that should be considered generally for getting to a multi modal API?

@younesbelkada
Copy link
Contributor

Thanks @RonanKMcGovern for your feedback ! I think TGI could support multi-modal models as they did it in the past with idefics if I am not mistaken cc @OlivierDehaene

@RonanKMcGovern
Copy link

Thanks @younesbelkada that makes sense intuitively. IDEFIX (flamenco style models) have a single tokenizer, whether it's image or text (if I'm not mistaken) so that makes it easier plug and play for TFI.

I see that as a pretty significant advantage. With an a good inference endpoint, llava just isn't as useful because devs can't use it well in production.

I need to read more on why llava 1.6 is stronger than IDEFIX. I guess IDEFIX has the drawback that it had to be entirely trained from scratch.

Makes me wonder whether it would have been better to take an IDEFIX approach in making Llava.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants