Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat] add I2VGenXL for image-to-video generation #6665

Merged
merged 184 commits into from
Jan 31, 2024
Merged

Conversation

sayakpaul
Copy link
Member

@sayakpaul sayakpaul commented Jan 22, 2024

What does this PR do?

Fixes: #6186.

Test code

import torch
from diffusers import I2VGenXLPipeline
from diffusers.utils import load_image, export_to_gif

# repo is currently private to the diffusers team
repo_id = "diffusers/i2vgen-xl" # TODO change checkpoint path after move
pipeline = I2VGenXLPipeline.from_pretrained(repo_id, torch_dtype=torch.float16, variant="fp16").to("cuda")

image_url = "https://github.com/ali-vilab/i2vgen-xl/blob/main/data/test_images/img_0009.png?download=true"
image = load_image(image_url).convert("RGB")
prompt = "Papers were floating in the air on a table in the library"

generator = torch.manual_seed(8888)
frames = pipeline(
    prompt=prompt,
    image=image,
    generator=generator
).frames[0]

print(export_to_gif(frames))

For memory optimization, use enable_model_cpu_offload().

TODO

  • Debug video quality
  • Docs
  • Tests
  • Eliminate einops dependency
  • General cleanup

Sorry, something went wrong.

@DN6
Copy link
Collaborator

DN6 commented Jan 31, 2024

@yiyixuxu Agreed that those embeddings can be differentiated better. Cool if I open up a follow up PR to clean it up?

@sayakpaul
Copy link
Member Author

@yiyixuxu Agreed that those embeddings can be differentiated better. Cool if I open up a follow up PR to clean it up?

Which comment is being referred to herr?

@patrickvonplaten
Copy link
Contributor

Still a couple of failing tests here

@sayakpaul
Copy link
Member Author

@DN6 could you look into it once?

@yiyixuxu
Copy link
Collaborator

@yiyixuxu Agreed that those embeddings can be differentiated better. Cool if I open up a follow up PR to clean it up?

sure!

@sayakpaul
Copy link
Member Author

Could I please see the link to the comment that is being referred to here? I am really unable to make sense of what’s meant by embedding cleanup.

@DN6
Copy link
Collaborator

DN6 commented Jan 31, 2024

Maybe we should rename the tensor to something like image_context_emb so that it does not have almost same name as the module layer :) a little bit hard to read like this for me. Just a suggestion, not super important though! :)

same goes with the context_embeddings/context_embedding

@sayakpaul This comment. But I addressed it and tests should also be fixed.

@sayakpaul
Copy link
Member Author

Cool thanks!

I think we still need to update the pipeline IDs here as well as the model card. We can then merge!

ff_output = gate_mlp.unsqueeze(1) * ff_output
elif self.use_ada_layer_norm_single:
elif self.norm_type == "ada_norm_single":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job here! Much cleaner now :-)

layers_per_block: int = 2,
norm_num_groups: Optional[int] = 32,
cross_attention_dim: int = 1024,
num_attention_heads: Optional[Union[int, Tuple[int]]] = 64,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice clean up!

):
super().__init__()

self.sample_size = sample_size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should not be needed ideally because one can access it with self.config.sample_size but ok!


self.transformer_in = TransformerTemporalModel(
num_attention_heads=8,
attention_head_dim=num_attention_heads,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
attention_head_dim=num_attention_heads,
attention_head_dim=num_attention_heads,

attention_head_dim=num_attention_heads looks a bit weird - think the class TransformerTemporalModel has bad naming here no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohhh I saw where this is coming from. I think we should only keep attention_head_dim argument for I2VGenXLUNet, and then we can do

self.transformer_in = TransformerTemporalModel(
            num_attention_heads=8,
            attention_head_dim=attention_head_dim,
            in_channels=block_out_channels[0],
            num_layers=1,
            norm_num_groups=norm_num_groups,
        )

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @DN6 here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TransformerTemporalModel does not have bad naming issue, and this is a new model so we definitely do not want to get the bad naming here ( here we force the user to pass what actually is attention_head_dim as num_attention_head is very much a bad naming practice )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep both i.e., attention_head_dim and num_attention_head. Because if we only stick with attention_head_dim, we will end up with something like:

down_block = get_down_block(
    down_block_type,
    num_layers=layers_per_block,
    in_channels=input_channel,
    out_channels=output_channel,
    temb_channels=time_embed_dim,
    add_downsample=not is_final_block,
    resnet_eps=1e-05,
    resnet_act_fn="silu",
    resnet_groups=norm_num_groups,
    cross_attention_dim=cross_attention_dim,
    num_attention_heads=attention_head_dim[i],
    downsample_padding=1,
    dual_cross_attention=False,
)

See how num_attention_heads is being assigned here. This again looks pretty bad naming-wise.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comments here https://github.com/huggingface/diffusers/pull/6665/files#r1477715567

I think we can still do

down_block = get_down_block(
    down_block_type,
    num_layers=layers_per_block,
    in_channels=input_channel,
    out_channels=output_channel,
    temb_channels=time_embed_dim,
    add_downsample=not is_final_block,
    resnet_eps=1e-05,
    resnet_act_fn="silu",
    resnet_groups=norm_num_groups,
    cross_attention_dim=cross_attention_dim,
    num_attention_heads=num_attention_heads[i],
    downsample_padding=1,
    dual_cross_attention=False,
)

logger = logging.get_logger(__name__) # pylint: disable=invalid-name


def _to_tensor(inputs, device):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's please not forget to clean this up after we merge (cc @sayakpaul @yiyixuxu).
This function is only used once - no need to move it into a function

@yiyixuxu yiyixuxu merged commit 04cd6ad into main Jan 31, 2024
@yiyixuxu yiyixuxu deleted the convert-i2vgen-xl branch January 31, 2024 20:39
layers_per_block: int = 2,
norm_num_groups: Optional[int] = 32,
cross_attention_dim: int = 1024,
num_attention_heads: Optional[Union[int, Tuple[int]]] = 64,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be attention_head_dim, no? @DN6 @sayakpaul

Copy link
Member Author

@sayakpaul sayakpaul Feb 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. I am unable to find out the comment from @patrickvonplaten, but look at how it's handled in UNet3D:

https://github.com/huggingface/diffusers/blob/c3369f56734beef9c4768d04cce490fdcc1c9162/src/diffusers/models/unets/unet_3d_condition.py#L130C1-L141C72

We can basically get rid of attention_head_dim and use num_attention_heads throughout to rectify the incorrect naming.

Edit:

Found out the comments:

dg845 pushed a commit to dg845/diffusers that referenced this pull request Feb 2, 2024
---------

Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: YiYi Xu <yixu310@gmail.com>
@vladmandic
Copy link
Contributor

this pr breaks a lot of existing pipelines as it redefines norm_types from what is very commonly used (and introduced a loong time ago).
for example, StableDiffusionReferencePipeline now fails with

AttributeError: 'BasicTransformerBlock' object has no attribute 'use_ada_layer_norm'

@sayakpaul
Copy link
Member Author

I welcome you to create an issue thread with a reproducible code snippet. We will fix it asap and if needed, do a patch release :-)

@vladmandic
Copy link
Contributor

vladmandic commented Feb 3, 2024

i just did, i wanted to make a note here first (it took me a bit to trace the error down)

resnet_act_fn="silu",
resnet_groups=norm_num_groups,
cross_attention_dim=cross_attention_dim,
num_attention_heads=num_attention_heads[i],
Copy link
Collaborator

@yiyixuxu yiyixuxu Feb 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I think it is a mistake here
get_down_block() expect num_attention_heads to be correct num_attention_heads

you look at how num_attention_heads is handled inside UNet3DConditionModel:

First, it fix the "bad "model config name as soon as it receives the argument, by doing to num_attention_heads = attention_head_dim here, so at this point, the variablenum_attention_heads is "corrected" for the rest of the code

num_attention_heads = num_attention_heads or attention_head_dim

from there, we can just pass it around as it is. e.g. it is passed as num_attention_heads in get_down_block()

num_attention_heads=num_attention_heads[i],

in our case, we:

  1. also have the "bad" model config, i.e. for this model num_attention_heads is actually attention_head_dim
  2. However, we only "correct" it for TransformerTemporalModel here. so for the rest of the code, num_attention_heads still means attention_head_dim - and that's incorrect. e.g. we passed "attention_head_dim" as num_attention_heads to get_down_block()
        self.transformer_in = TransformerTemporalModel(
            num_attention_heads=8,
            attention_head_dim=num_attention_heads,
            in_channels=block_out_channels[0],
            num_layers=1,
            norm_num_groups=norm_num_groups,
        )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am quite confused now.

First, it fix the "bad "model config name as soon as it receives the argument, by doing to num_attention_heads = attention_head_dim here, so at this point, the variablenum_attention_heads is "corrected" for the rest of the code

This is what we didn't want to do in this UNet i.e., the comments in #6665 (comment) suggested that there shouldn't be any attention_head_dim. We want to get rid of those corrections, made in UNet3DConditionModel for num_attention_heads.

Do you mean we initialize another new variable called attention_head_dim based on what's passed to num_attention_heads?

AmericanPresidentJimmyCarter pushed a commit to AmericanPresidentJimmyCarter/diffusers that referenced this pull request Apr 26, 2024
---------

Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: YiYi Xu <yixu310@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
video video generation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

i2vgen-xl implement
6 participants