-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feat] add I2VGenXL for image-to-video generation #6665
Conversation
@yiyixuxu Agreed that those embeddings can be differentiated better. Cool if I open up a follow up PR to clean it up? |
Which comment is being referred to herr? |
Still a couple of failing tests here |
@DN6 could you look into it once? |
sure! |
Could I please see the link to the comment that is being referred to here? I am really unable to make sense of what’s meant by embedding cleanup. |
@sayakpaul This comment. But I addressed it and tests should also be fixed. |
Cool thanks! I think we still need to update the pipeline IDs here as well as the model card. We can then merge! |
ff_output = gate_mlp.unsqueeze(1) * ff_output | ||
elif self.use_ada_layer_norm_single: | ||
elif self.norm_type == "ada_norm_single": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job here! Much cleaner now :-)
layers_per_block: int = 2, | ||
norm_num_groups: Optional[int] = 32, | ||
cross_attention_dim: int = 1024, | ||
num_attention_heads: Optional[Union[int, Tuple[int]]] = 64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice clean up!
): | ||
super().__init__() | ||
|
||
self.sample_size = sample_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should not be needed ideally because one can access it with self.config.sample_size
but ok!
|
||
self.transformer_in = TransformerTemporalModel( | ||
num_attention_heads=8, | ||
attention_head_dim=num_attention_heads, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
attention_head_dim=num_attention_heads, | |
attention_head_dim=num_attention_heads, |
attention_head_dim=num_attention_heads looks a bit weird - think the class TransformerTemporalModel has bad naming here no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This came as a consequence of:
- [Feat] add I2VGenXL for image-to-video generation #6665 (comment)
- [Feat] add I2VGenXL for image-to-video generation #6665 (comment)
- [Feat] add I2VGenXL for image-to-video generation #6665 (comment)
See #6665 (comment) for more details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ohhh I saw where this is coming from. I think we should only keep attention_head_dim
argument for I2VGenXLUNet
, and then we can do
self.transformer_in = TransformerTemporalModel(
num_attention_heads=8,
attention_head_dim=attention_head_dim,
in_channels=block_out_channels[0],
num_layers=1,
norm_num_groups=norm_num_groups,
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @DN6 here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TransformerTemporalModel
does not have bad naming issue, and this is a new model so we definitely do not want to get the bad naming here ( here we force the user to pass what actually is attention_head_dim
as num_attention_head
is very much a bad naming practice )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep both i.e., attention_head_dim
and num_attention_head
. Because if we only stick with attention_head_dim
, we will end up with something like:
down_block = get_down_block(
down_block_type,
num_layers=layers_per_block,
in_channels=input_channel,
out_channels=output_channel,
temb_channels=time_embed_dim,
add_downsample=not is_final_block,
resnet_eps=1e-05,
resnet_act_fn="silu",
resnet_groups=norm_num_groups,
cross_attention_dim=cross_attention_dim,
num_attention_heads=attention_head_dim[i],
downsample_padding=1,
dual_cross_attention=False,
)
See how num_attention_heads
is being assigned here. This again looks pretty bad naming-wise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see my comments here https://github.com/huggingface/diffusers/pull/6665/files#r1477715567
I think we can still do
down_block = get_down_block(
down_block_type,
num_layers=layers_per_block,
in_channels=input_channel,
out_channels=output_channel,
temb_channels=time_embed_dim,
add_downsample=not is_final_block,
resnet_eps=1e-05,
resnet_act_fn="silu",
resnet_groups=norm_num_groups,
cross_attention_dim=cross_attention_dim,
num_attention_heads=num_attention_heads[i],
downsample_padding=1,
dual_cross_attention=False,
)
logger = logging.get_logger(__name__) # pylint: disable=invalid-name | ||
|
||
|
||
def _to_tensor(inputs, device): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's please not forget to clean this up after we merge (cc @sayakpaul @yiyixuxu).
This function is only used once - no need to move it into a function
layers_per_block: int = 2, | ||
norm_num_groups: Optional[int] = 32, | ||
cross_attention_dim: int = 1024, | ||
num_attention_heads: Optional[Union[int, Tuple[int]]] = 64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be attention_head_dim
, no? @DN6 @sayakpaul
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. I am unable to find out the comment from @patrickvonplaten, but look at how it's handled in UNet3D:
We can basically get rid of attention_head_dim
and use num_attention_heads
throughout to rectify the incorrect naming.
Edit:
Found out the comments:
--------- Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: YiYi Xu <yixu310@gmail.com>
this pr breaks a lot of existing pipelines as it redefines norm_types from what is very commonly used (and introduced a loong time ago).
|
I welcome you to create an issue thread with a reproducible code snippet. We will fix it asap and if needed, do a patch release :-) |
i just did, i wanted to make a note here first (it took me a bit to trace the error down) |
resnet_act_fn="silu", | ||
resnet_groups=norm_num_groups, | ||
cross_attention_dim=cross_attention_dim, | ||
num_attention_heads=num_attention_heads[i], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so I think it is a mistake here
get_down_block()
expect num_attention_heads
to be correct num_attention_heads
you look at how num_attention_heads
is handled inside UNet3DConditionModel
:
First, it fix the "bad "model config name as soon as it receives the argument, by doing to num_attention_heads = attention_head_dim
here, so at this point, the variablenum_attention_heads
is "corrected" for the rest of the code
num_attention_heads = num_attention_heads or attention_head_dim |
from there, we can just pass it around as it is. e.g. it is passed as num_attention_heads
in get_down_block()
num_attention_heads=num_attention_heads[i], |
in our case, we:
- also have the "bad" model config, i.e. for this model
num_attention_heads
is actuallyattention_head_dim
- However, we only "correct" it for
TransformerTemporalModel
here. so for the rest of the code,num_attention_heads
still meansattention_head_dim
- and that's incorrect. e.g. we passed "attention_head_dim" asnum_attention_heads
toget_down_block()
self.transformer_in = TransformerTemporalModel(
num_attention_heads=8,
attention_head_dim=num_attention_heads,
in_channels=block_out_channels[0],
num_layers=1,
norm_num_groups=norm_num_groups,
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am quite confused now.
First, it fix the "bad "model config name as soon as it receives the argument, by doing to num_attention_heads = attention_head_dim here, so at this point, the variablenum_attention_heads is "corrected" for the rest of the code
This is what we didn't want to do in this UNet i.e., the comments in #6665 (comment) suggested that there shouldn't be any attention_head_dim
. We want to get rid of those corrections, made in UNet3DConditionModel
for num_attention_heads
.
Do you mean we initialize another new variable called attention_head_dim
based on what's passed to num_attention_heads
?
--------- Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: YiYi Xu <yixu310@gmail.com>
What does this PR do?
Fixes: #6186.
Test code
For memory optimization, use
enable_model_cpu_offload()
.TODO
einops
dependency