Add SynthID (watermerking by Google DeepMind) #34350

gante · 2024-10-23T15:09:44Z

What does this PR do?

Adds SynthID, a watermarking by DeepMind.

https://deepmind.google/technologies/synthid/

Applying watermarking and using a detector is added to transfomers. Training a detector is added as a research project.

# Add PT version of the bayesian detector

Rebase

…r learning rate.

HuggingFaceDocBuilderDev · 2024-10-23T15:56:24Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

First path! Looks really good!

examples/research_projects/synthid_text/detector_training.py

examples/research_projects/synthid_text/utils.py

ArthurZucker · 2024-10-23T16:43:03Z

examples/research_projects/synthid_text/utils.py

+    return all_masks, all_g_values
+
+
+def tpr_at_fpr(detector, detector_inputs, w_true, minibatch_size, target_fpr=0.01) -> torch.Tensor:


I have no idea what tpr and fpr means, let's either be explicit, or have a small docstring

expanded docstring 👍

examples/research_projects/synthid_text/utils.py

src/transformers/generation/configuration_utils.py

src/transformers/generation/logits_process.py

ArthurZucker · 2024-10-23T16:51:30Z

src/transformers/generation/watermarking.py

+        self.beta = torch.nn.Parameter(-2.5 + 0.001 * torch.randn(1, 1, watermarking_depth))
+        self.delta = torch.nn.Parameter(0.001 * torch.randn(1, 1, self.watermarking_depth, watermarking_depth))


these are usually initialized in the _init_weights function rather than here!

Ihere we init with zeros or empty

It's not common, but we do have this pattern in other places (e.g.)

(I also have no idea how to set this specific initialization in _init_weights 😅 )

ArthurZucker · 2024-10-23T16:59:58Z

src/transformers/generation/watermarking.py

+
+        # [batch_size, seq_len, watermarking_depth]
+        # Long tensor doesn't work with einsum, so we need to switch to the same dtype as self.delta (FP32)
+        logits = torch.einsum("ijkl,ijkl->ijk", self.delta, x.type(self.delta.dtype)) + self.beta


would be a lot better if we can avoid einsums! 🤗

(i, j, k, l) x (i, j, k, l) -> (i, j, k)

would be:

(i, j, k, 1, l) x (i, j, k, l, 1) -> (i, j, k,1)

so:

self.delta[.., None,:] @ x.transpose(-2,-1)[..., None])

Good idea!

(the correct form is then (self.delta[.., None,:] @ x[..., None]).squeeze())

ArthurZucker

Thanks for updating! 🤗

* Add SynthIDTextWatermarkLogitsProcessor * esolving comments. * Resolving comments. * esolving commits, * Improving SynthIDWatermark tests. * switch to PT version * detector as pretrained model + style * update training + style * rebase * Update logits_process.py * Improving SynthIDWatermark tests. * Shift detector training to wikitext negatives and stabilize with lower learning rate. * Clean up. * in for 7B * cleanup * upport python 3.8. * README and final cleanup. * HF Hub upload and initiaze. * Update requirements for synthid_text. * Adding SynthIDTextWatermarkDetector. * Detector testing. * Documentation changes. * Copyrights fix. * Fix detector api. * ironing out errors * ironing out errors * training checks * make fixup and make fix-copies * docstrings and add to docs * copyright * BC * test docstrings * move import * protect type hints * top level imports * watermarking example * direct imports * tpr fpr meaning * process_kwargs * SynthIDTextWatermarkingConfig docstring * assert -> exception * example updates * no immutable dict (cant be serialized) * pack fn * einsum equivalent * import order * fix test on gpu * add detector example --------- Co-authored-by: Sumedh Ghaisas <sumedhg@google.com> Co-authored-by: Marc Sun <marc@huggingface.co> Co-authored-by: sumedhghaisas2 <138781311+sumedhghaisas2@users.noreply.github.com> Co-authored-by: raushan <raushan@huggingface.co>

sumedhghaisas2 and others added 30 commits August 19, 2024 12:44

Add SynthIDTextWatermarkLogitsProcessor

ce6d213

esolving comments.

19b599b

Resolving comments.

22fcc5b

esolving commits,

4251d6e

Improving SynthIDWatermark tests.

9b7b7b1

switch to PT version

b65e3fe

Merge pull request #2 from sumedhghaisas2/add-pt-version-detector

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified
Learn about vigilant mode

14ad3bb

# Add PT version of the bayesian detector

detector as pretrained model + style

e03312c

update training + style

fc6fcd2

rebase

b62edf8

Merge branch 'main' into rebase

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified
Learn about vigilant mode

efeadb5

Merge pull request #4 from sumedhghaisas2/rebase

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified
Learn about vigilant mode

bd8d720

Rebase

Update logits_process.py

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified
Learn about vigilant mode

7da4040

Improving SynthIDWatermark tests.

edf25d4

Shift detector training to wikitext negatives and stabilize with lowe…

a0f5fc2

…r learning rate.

Clean up.

7b23cd7

in for 7B

a7ae44a

cleanup

a45dd5a

upport python 3.8.

996066c

README and final cleanup.

9d4f177

HF Hub upload and initiaze.

9b8805f

Update requirements for synthid_text.

1a263a5

Adding SynthIDTextWatermarkDetector.

ccea5dd

Detector testing.

fb0f770

Documentation changes.

9d1e1ee

Copyrights fix.

7fbbaa9

Fix detector api.

78ca99f

ironing out errors

3409dbd

ironing out errors

cc254b1

training checks

97dcb98

make fixup and make fix-copies

Loading
Loading status checks…

a6d9a00

gante added 4 commits October 23, 2024 15:58

docstrings and add to docs

Loading
Loading status checks…

8fea35d

copyright

Loading
Loading status checks…

ee01b7a

BC

Loading
Loading status checks…

fd1f53f

test docstrings

Loading
Loading status checks…

98b1963

gante marked this pull request as ready for review October 23, 2024 16:18

gante requested a review from ArthurZucker October 23, 2024 16:18

gante added 4 commits October 23, 2024 16:23

move import

Loading
Loading status checks…

021c685

protect type hints

Loading
Loading status checks…

34bd287

top level imports

Loading
Loading status checks…

adb3d01

watermarking example

Loading
Loading status checks…

9a7abae

ArthurZucker reviewed Oct 23, 2024

View reviewed changes

gante added 12 commits October 23, 2024 17:11

direct imports

Loading
Loading status checks…

62a0b4e

tpr fpr meaning

Loading
Loading status checks…

7aec58b

process_kwargs

Loading
Loading status checks…

70b9dfb

SynthIDTextWatermarkingConfig docstring

Loading
Loading status checks…

ef4e461

assert -> exception

Loading
Loading status checks…

058bfa1

example updates

Loading
Loading status checks…

2ac3193

no immutable dict (cant be serialized)

Loading
Loading status checks…

26370a4

pack fn

Loading
Loading status checks…

ba28b14

einsum equivalent

Loading
Loading status checks…

178607d

import order

Loading
Loading status checks…

04c3f1e

fix test on gpu

Loading
Loading status checks…

4a6e8df

add detector example

Loading
Loading status checks…

613aae2

ArthurZucker approved these changes Oct 23, 2024

View reviewed changes

gante merged commit b0f0c61 into main Oct 23, 2024
26 checks passed

gante deleted the synthid branch October 23, 2024 20:18

marluxiaboss mentioned this pull request Nov 6, 2024

Issue with SynthID watermark implementation #34630

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SynthID (watermerking by Google DeepMind) #34350

Add SynthID (watermerking by Google DeepMind) #34350

gante commented Oct 23, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Oct 23, 2024

ArthurZucker left a comment

ArthurZucker Oct 23, 2024

gante Oct 23, 2024

ArthurZucker Oct 23, 2024

ArthurZucker Oct 23, 2024

gante Oct 23, 2024

ArthurZucker Oct 23, 2024

gante Oct 23, 2024

ArthurZucker left a comment

		return all_masks, all_g_values


		def tpr_at_fpr(detector, detector_inputs, w_true, minibatch_size, target_fpr=0.01) -> torch.Tensor:

		self.beta = torch.nn.Parameter(-2.5 + 0.001 * torch.randn(1, 1, watermarking_depth))
		self.delta = torch.nn.Parameter(0.001 * torch.randn(1, 1, self.watermarking_depth, watermarking_depth))

Add SynthID (watermerking by Google DeepMind) #34350

Add SynthID (watermerking by Google DeepMind) #34350

Conversation

gante commented Oct 23, 2024 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Oct 23, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Oct 23, 2024

Choose a reason for hiding this comment

gante Oct 23, 2024

Choose a reason for hiding this comment

ArthurZucker Oct 23, 2024

Choose a reason for hiding this comment

ArthurZucker Oct 23, 2024

Choose a reason for hiding this comment

gante Oct 23, 2024

Choose a reason for hiding this comment

ArthurZucker Oct 23, 2024

Choose a reason for hiding this comment

gante Oct 23, 2024

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

gante commented Oct 23, 2024 •

edited

Loading