Tweak sequence tokenization #10904

crusaderky · 2024-02-06T19:12:08Z

Allow calling normalize_token() outside of tokenize(). This is useful for debugging purposes.
Additionally, polish some tests in test_tokenize.

github-actions · 2024-02-06T20:14:03Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

15 files ± 0 15 suites ±0 3h 13m 3s ⏱️ - 6m 13s
12 988 tests + 1 12 059 ✅ + 1 929 💤 ±0 0 ❌ ±0
160 507 runs +15 144 002 ✅ +16 16 505 💤 - 1 0 ❌ ±0

Results for commit 097a48a. ± Comparison against base commit f51fa77.

This pull request removes 1 and adds 2 tests. Note that renamed tests count towards both.

dask.tests.test_tokenize ‑ test_normalize_function

dask.tests.test_tokenize ‑ test_tokenize_composite_functions
dask.tests.test_tokenize ‑ test_tokenize_numpy_array

phofl · 2024-02-08T12:42:27Z

dask/base.py

+        seen = _seen.get()
+        tok = None
+    except LookupError:
+        # This is for debug only, for when normalize_token is called outside of


This seems odd, do we really want to keep this?

It is extremely useful to figure out why tokens have changed when anything goes wrong. I'm already using it in dask/distributed#8185

phofl · 2024-02-08T13:49:27Z

thx

Tweak sequence tokenization

e4afc28

crusaderky self-assigned this Feb 6, 2024

fix test

097a48a

crusaderky marked this pull request as ready for review February 6, 2024 19:42

crusaderky mentioned this pull request Feb 6, 2024

Tokenization meta-issue #10905

Closed

phofl reviewed Feb 8, 2024

View reviewed changes

phofl approved these changes Feb 8, 2024

View reviewed changes

phofl merged commit 1cee596 into dask:main Feb 8, 2024
27 of 28 checks passed

crusaderky deleted the tokenize_more branch February 8, 2024 14:26

rjzamora mentioned this pull request Feb 28, 2024

Introduce basic "cudf" backend for Dask Expressions rapidsai/cudf#14805

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tweak sequence tokenization #10904

Tweak sequence tokenization #10904

crusaderky commented Feb 6, 2024 •

edited

github-actions bot commented Feb 6, 2024

phofl Feb 8, 2024

crusaderky Feb 8, 2024

phofl Feb 8, 2024

phofl commented Feb 8, 2024

Tweak sequence tokenization #10904

Tweak sequence tokenization #10904

Conversation

crusaderky commented Feb 6, 2024 • edited

github-actions bot commented Feb 6, 2024

Unit Test Results

phofl Feb 8, 2024

Choose a reason for hiding this comment

crusaderky Feb 8, 2024

Choose a reason for hiding this comment

phofl Feb 8, 2024

Choose a reason for hiding this comment

phofl commented Feb 8, 2024

crusaderky commented Feb 6, 2024 •

edited