[Feat] add I2VGenXL for image-to-video generation #6665

sayakpaul · 2024-01-22T11:22:52Z

What does this PR do?

Fixes: #6186.

Test code

import torch
from diffusers import I2VGenXLPipeline
from diffusers.utils import load_image, export_to_gif

# repo is currently private to the diffusers team
repo_id = "diffusers/i2vgen-xl" # TODO change checkpoint path after move
pipeline = I2VGenXLPipeline.from_pretrained(repo_id, torch_dtype=torch.float16, variant="fp16").to("cuda")

image_url = "https://github.com/ali-vilab/i2vgen-xl/blob/main/data/test_images/img_0009.png?download=true"
image = load_image(image_url).convert("RGB")
prompt = "Papers were floating in the air on a table in the library"

generator = torch.manual_seed(8888)
frames = pipeline(
    prompt=prompt,
    image=image,
    generator=generator
).frames[0]

print(export_to_gif(frames))

For memory optimization, use enable_model_cpu_offload().

TODO

This reverts commit a736694.

This reverts commit 896d626.

DN6 · 2024-01-31T14:40:35Z

@yiyixuxu Agreed that those embeddings can be differentiated better. Cool if I open up a follow up PR to clean it up?

sayakpaul · 2024-01-31T14:43:09Z

@yiyixuxu Agreed that those embeddings can be differentiated better. Cool if I open up a follow up PR to clean it up?

Which comment is being referred to herr?

patrickvonplaten · 2024-01-31T15:56:50Z

Still a couple of failing tests here

sayakpaul · 2024-01-31T15:58:54Z

@DN6 could you look into it once?

yiyixuxu · 2024-01-31T16:23:08Z

@yiyixuxu Agreed that those embeddings can be differentiated better. Cool if I open up a follow up PR to clean it up?

sure!

sayakpaul · 2024-01-31T16:36:31Z

Could I please see the link to the comment that is being referred to here? I am really unable to make sense of what’s meant by embedding cleanup.

DN6 · 2024-01-31T16:47:10Z

Maybe we should rename the tensor to something like image_context_emb so that it does not have almost same name as the module layer :) a little bit hard to read like this for me. Just a suggestion, not super important though! :)

same goes with the context_embeddings/context_embedding

@sayakpaul This comment. But I addressed it and tests should also be fixed.

sayakpaul · 2024-01-31T16:48:33Z

Cool thanks!

I think we still need to update the pipeline IDs here as well as the model card. We can then merge!

patrickvonplaten · 2024-01-31T20:03:04Z

src/diffusers/models/attention.py

            ff_output = gate_mlp.unsqueeze(1) * ff_output
-        elif self.use_ada_layer_norm_single:
+        elif self.norm_type == "ada_norm_single":


Great job here! Much cleaner now :-)

patrickvonplaten · 2024-01-31T20:06:19Z

src/diffusers/models/unets/unet_i2vgen_xl.py

+        layers_per_block: int = 2,
+        norm_num_groups: Optional[int] = 32,
+        cross_attention_dim: int = 1024,
+        num_attention_heads: Optional[Union[int, Tuple[int]]] = 64,


nice clean up!

patrickvonplaten · 2024-01-31T20:06:58Z

src/diffusers/models/unets/unet_i2vgen_xl.py

+    ):
+        super().__init__()
+
+        self.sample_size = sample_size


should not be needed ideally because one can access it with self.config.sample_size but ok!

patrickvonplaten · 2024-01-31T20:07:35Z

src/diffusers/models/unets/unet_i2vgen_xl.py

+
+        self.transformer_in = TransformerTemporalModel(
+            num_attention_heads=8,
+            attention_head_dim=num_attention_heads,


Suggested change

attention_head_dim=num_attention_heads,

attention_head_dim=num_attention_heads,

attention_head_dim=num_attention_heads looks a bit weird - think the class TransformerTemporalModel has bad naming here no?

This came as a consequence of:

[Feat] add I2VGenXL for image-to-video generation #6665 (comment)

[Feat] add I2VGenXL for image-to-video generation #6665 (comment)

[Feat] add I2VGenXL for image-to-video generation #6665 (comment)

See #6665 (comment) for more details.

ohhh I saw where this is coming from. I think we should only keep attention_head_dim argument for I2VGenXLUNet, and then we can do

self.transformer_in = TransformerTemporalModel( num_attention_heads=8, attention_head_dim=attention_head_dim, in_channels=block_out_channels[0], num_layers=1, norm_num_groups=norm_num_groups, )

cc @DN6 here

TransformerTemporalModel does not have bad naming issue, and this is a new model so we definitely do not want to get the bad naming here ( here we force the user to pass what actually is attention_head_dim as num_attention_head is very much a bad naming practice )

I think we should keep both i.e., attention_head_dim and num_attention_head. Because if we only stick with attention_head_dim, we will end up with something like:

down_block = get_down_block( down_block_type, num_layers=layers_per_block, in_channels=input_channel, out_channels=output_channel, temb_channels=time_embed_dim, add_downsample=not is_final_block, resnet_eps=1e-05, resnet_act_fn="silu", resnet_groups=norm_num_groups, cross_attention_dim=cross_attention_dim, num_attention_heads=attention_head_dim[i], downsample_padding=1, dual_cross_attention=False, )

See how num_attention_heads is being assigned here. This again looks pretty bad naming-wise.

see my comments here https://github.com/huggingface/diffusers/pull/6665/files#r1477715567

I think we can still do

down_block = get_down_block( down_block_type, num_layers=layers_per_block, in_channels=input_channel, out_channels=output_channel, temb_channels=time_embed_dim, add_downsample=not is_final_block, resnet_eps=1e-05, resnet_act_fn="silu", resnet_groups=norm_num_groups, cross_attention_dim=cross_attention_dim, num_attention_heads=num_attention_heads[i], downsample_padding=1, dual_cross_attention=False, )

patrickvonplaten · 2024-01-31T20:08:29Z

src/diffusers/models/unets/unet_i2vgen_xl.py

+logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
+
+
+def _to_tensor(inputs, device):


Let's please not forget to clean this up after we merge (cc @sayakpaul @yiyixuxu).
This function is only used once - no need to move it into a function

yiyixuxu · 2024-02-01T01:07:26Z

src/diffusers/models/unets/unet_i2vgen_xl.py

+        layers_per_block: int = 2,
+        norm_num_groups: Optional[int] = 32,
+        cross_attention_dim: int = 1024,
+        num_attention_heads: Optional[Union[int, Tuple[int]]] = 64,


this should be attention_head_dim, no? @DN6 @sayakpaul

I don't think so. I am unable to find out the comment from @patrickvonplaten, but look at how it's handled in UNet3D:

https://github.com/huggingface/diffusers/blob/c3369f56734beef9c4768d04cce490fdcc1c9162/src/diffusers/models/unets/unet_3d_condition.py#L130C1-L141C72

We can basically get rid of attention_head_dim and use num_attention_heads throughout to rectify the incorrect naming.

Edit:

Found out the comments:

[Feat] add I2VGenXL for image-to-video generation #6665 (comment)

[Feat] add I2VGenXL for image-to-video generation #6665 (comment)

[Feat] add I2VGenXL for image-to-video generation #6665 (comment)

--------- Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: YiYi Xu <yixu310@gmail.com>

vladmandic · 2024-02-03T18:26:01Z

this pr breaks a lot of existing pipelines as it redefines norm_types from what is very commonly used (and introduced a loong time ago).
for example, StableDiffusionReferencePipeline now fails with

AttributeError: 'BasicTransformerBlock' object has no attribute 'use_ada_layer_norm'

sayakpaul · 2024-02-03T18:29:58Z

I welcome you to create an issue thread with a reproducible code snippet. We will fix it asap and if needed, do a patch release :-)

vladmandic · 2024-02-03T18:32:33Z

i just did, i wanted to make a note here first (it took me a bit to trace the error down)

yiyixuxu · 2024-02-05T06:22:20Z

src/diffusers/models/unets/unet_i2vgen_xl.py

+                resnet_act_fn="silu",
+                resnet_groups=norm_num_groups,
+                cross_attention_dim=cross_attention_dim,
+                num_attention_heads=num_attention_heads[i],


so I think it is a mistake here
get_down_block() expect num_attention_heads to be correct num_attention_heads

you look at how num_attention_heads is handled inside UNet3DConditionModel:

First, it fix the "bad "model config name as soon as it receives the argument, by doing to num_attention_heads = attention_head_dim here, so at this point, the variablenum_attention_heads is "corrected" for the rest of the code

diffusers/src/diffusers/models/unets/unet_3d_condition.py

Line 141 in fdf55b1

num_attention_heads = num_attention_heads or attention_head_dim

from there, we can just pass it around as it is. e.g. it is passed as num_attention_heads in get_down_block()

diffusers/src/diffusers/models/unets/unet_3d_condition.py

Line 211 in fdf55b1

num_attention_heads=num_attention_heads[i],

in our case, we:

also have the "bad" model config, i.e. for this model num_attention_heads is actually attention_head_dim

However, we only "correct" it for TransformerTemporalModel here. so for the rest of the code, num_attention_heads still means attention_head_dim - and that's incorrect. e.g. we passed "attention_head_dim" as num_attention_heads to get_down_block()

self.transformer_in = TransformerTemporalModel( num_attention_heads=8, attention_head_dim=num_attention_heads, in_channels=block_out_channels[0], num_layers=1, norm_num_groups=norm_num_groups, )

I am quite confused now.

First, it fix the "bad "model config name as soon as it receives the argument, by doing to num_attention_heads = attention_head_dim here, so at this point, the variablenum_attention_heads is "corrected" for the rest of the code

This is what we didn't want to do in this UNet i.e., the comments in #6665 (comment) suggested that there shouldn't be any attention_head_dim. We want to get rid of those corrections, made in UNet3DConditionModel for num_attention_heads.

Do you mean we initialize another new variable called attention_head_dim based on what's passed to num_attention_heads?

--------- Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: YiYi Xu <yixu310@gmail.com>

sayakpaul added 30 commits January 5, 2024 10:06

let's see

Verified

This commit was signed with the committer’s verified signature.

nlf nlf

SSH Key Fingerprint: BJwQLfDy+6IM7VyCh65kB/gIPFFn29cppiLur87xad0
Verified
Learn about vigilant mode

bb7c412

better conditioning for class_embed_type

d537b6c

determine in_channels programatically.

15f1607

worse condition

a329b73

fix: sample_size.

5660ba1

Merge branch 'main' into convert-i2vgen-xl

eb8ea72

separte script for i2vgen

011329d

changes

3e0015d

fix: basic transformer block init.

f09c2dd

check

7dd0cb0

revert block_out_channels.

d6f1e6d

debug info

da5b83c

debug info

0ecef35

debug info

13ecc11

debug

6778b3f

correct ffn inner dim

20aeaf3

debug info

ef85c84

input channels should be 8./

7d03162

input channels corrected

a736694

Revert "input channels corrected"

34e7349

This reverts commit a736694.

better input channels

896d626

Revert "better input channels"

02b76b5

This reverts commit 896d626.

rectify conversion script

15a6fbd

conversion script.

5a09722

conversio

bcccfdf

push_to_hub

1c68e05

remove print

3b5940b

let's see.

aaae032

safeguard .

1c72370

device place,ent

25527f8

Merge branch 'main' into convert-i2vgen-xl

2c1caea

clean up

513ab1f

update

13fcc20

update

7b7f075

change checkpoints.

fe50995

patrickvonplaten reviewed Jan 31, 2024

View reviewed changes

patrickvonplaten approved these changes Jan 31, 2024

View reviewed changes

yiyixuxu merged commit 04cd6ad into main Jan 31, 2024

yiyixuxu deleted the convert-i2vgen-xl branch January 31, 2024 20:39

yiyixuxu reviewed Feb 1, 2024

View reviewed changes

vladmandic mentioned this pull request Feb 3, 2024

Diffusers 0.26 break a lot of existing pipelines and 3rd party modules #6838

Closed

sayakpaul mentioned this pull request Feb 4, 2024

[I2vGenXL] clean up things #6845

Merged

yiyixuxu reviewed Feb 5, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat] add I2VGenXL for image-to-video generation #6665

[Feat] add I2VGenXL for image-to-video generation #6665

sayakpaul commented Jan 22, 2024 •

edited

Loading

DN6 commented Jan 31, 2024

sayakpaul commented Jan 31, 2024

patrickvonplaten commented Jan 31, 2024

sayakpaul commented Jan 31, 2024

yiyixuxu commented Jan 31, 2024

sayakpaul commented Jan 31, 2024

DN6 commented Jan 31, 2024

sayakpaul commented Jan 31, 2024

patrickvonplaten Jan 31, 2024

patrickvonplaten Jan 31, 2024

patrickvonplaten Jan 31, 2024

patrickvonplaten Jan 31, 2024

sayakpaul Feb 1, 2024

yiyixuxu Feb 5, 2024

yiyixuxu Feb 5, 2024

yiyixuxu Feb 5, 2024

sayakpaul Feb 5, 2024

yiyixuxu Feb 5, 2024

patrickvonplaten Jan 31, 2024

yiyixuxu Feb 1, 2024

sayakpaul Feb 1, 2024 •

edited

Loading

vladmandic commented Feb 3, 2024

sayakpaul commented Feb 3, 2024

vladmandic commented Feb 3, 2024 •

edited

Loading

yiyixuxu Feb 5, 2024 •

edited

Loading

sayakpaul Feb 5, 2024

	attention_head_dim=num_attention_heads,
	attention_head_dim=num_attention_heads,

		logger = logging.get_logger(__name__) # pylint: disable=invalid-name


		def _to_tensor(inputs, device):

[Feat] add I2VGenXL for image-to-video generation #6665

[Feat] add I2VGenXL for image-to-video generation #6665

Conversation

sayakpaul commented Jan 22, 2024 • edited Loading

What does this PR do?

Test code

TODO

DN6 commented Jan 31, 2024

sayakpaul commented Jan 31, 2024

patrickvonplaten commented Jan 31, 2024

sayakpaul commented Jan 31, 2024

yiyixuxu commented Jan 31, 2024

sayakpaul commented Jan 31, 2024

DN6 commented Jan 31, 2024

sayakpaul commented Jan 31, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sayakpaul Feb 1, 2024 • edited Loading

Choose a reason for hiding this comment

vladmandic commented Feb 3, 2024

sayakpaul commented Feb 3, 2024

vladmandic commented Feb 3, 2024 • edited Loading

yiyixuxu Feb 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sayakpaul commented Jan 22, 2024 •

edited

Loading

sayakpaul Feb 1, 2024 •

edited

Loading

vladmandic commented Feb 3, 2024 •

edited

Loading

yiyixuxu Feb 5, 2024 •

edited

Loading