nltk.tokenize.casual.TweetTokenizer: Add support to tokenize emoji flag sequences #3034

essandess · 2022-08-22T14:05:56Z

nltk.tokenize.casual.TweetTokenizer: Add support to tokenize emoji flag sequences

TweetTokenizer currently splits emoji flags into individual tokens of enclosed letters, e.g. 🇨🇦 -> '🇨 ', '🇦 '
This patch keeps emoji flag sequences intact
Without this PR:

nltk.tokenize.casual.TweetTokenizer().tokenize(text='Hi 🇨🇦, 😍!!')
['Hi', '🇨', '🇦', ',', '😍', '!', '!']

With this PR:

nltk.tokenize.casual.TweetTokenizer().tokenize(text='Hi 🇨🇦, 😍!!')
['Hi', '🇨🇦', ',', '😍', '!', '!']

tomaarsen · 2022-08-22T14:48:11Z

See #2578 (comment) for some related discussion. In short, I am in favour of adding this behaviour, but I would like to start a discussion on whether it would be wise to include an option to turn it off in TweetTokenizer.__init__.

After all, as you are aware, flags are made up of regional indicator symbols. These symbols are also frequently used outside of the context of flags, like to spell out words: 🇦 🇵 🇵 🇱 🇪. However, this is commonly done with spaces between the tokens, as otherwise that spelling of "APPLE" would have had a Polish flag in the middle. In short, it may not be an issue.

Beyond that discussion, would you be able to write a few tests to back up that your implementation works as intended? For example in here: https://github.com/nltk/nltk/blob/develop/nltk/test/unit/test_tokenize.py

Also, I'm uncertain why the CI is failing. It seems like a third party download of a file used in one of the tests has failed:

curl: (28) Failed to connect to www.cs.unm.edu port 443: Connection timed out

It's unrelated to your PR, though, so no stress.

essandess · 2022-08-22T20:19:10Z

would you be able to write a few tests to back up that your implementation works as intended? For example in here: https://github.com/nltk/nltk/blob/develop/nltk/test/unit/test_tokenize.py

Yes, I'll modify that file as part of this PR.

Given your comments, what do you think of the decision use a regex that encodes all pairs of enclosed letters as tokens? Under this behavior, your example is tokenized as follow. Is this "correct"? Or should this repo have a record of all ISO country code letter pairs and treat these separately in a much more complicated regex? My preference is to avoid this.

nltk.tokenize.casual.TweetTokenizer().tokenize('🇦🇵🇵🇱🇪')
['🇦🇵', '🇵🇱', '🇪']

tomaarsen · 2022-08-23T11:49:51Z

I also prefer your current implementation, i.e. any two directly adjacent regional indicator symbols. This is also more future-proof. The behaviour from your little snippet seems like the expected output as well. 👍

essandess · 2022-08-23T19:15:22Z

I also prefer your current implementation, i.e. any two directly adjacent regional indicator symbols. This is also more future-proof. The behaviour from your little snippet seems like the expected output as well. 👍

Great, implemented with a test in test_tokenize.py.

…ag sequences * TweetTokenizer currently splits emoji flags into individual tokens of enclosed letters, e.g. 🇨🇦 -> '🇨 ', '🇦 ' * This patch keeps emoji flag sequences intact * Without this PR: > ```python > nltk.tokenize.casual.TweetTokenizer().tokenize(text='Hi 🇨🇦, 😍!!') > ['Hi', '🇨', '🇦', ',', '😍', '!', '!'] > ``` * With this PR: > ```python > nltk.tokenize.casual.TweetTokenizer().tokenize(text='Hi 🇨🇦, 😍!!') > ['Hi', '🇨🇦', ',', '😍', '!', '!'] > ```

tomaarsen · 2022-09-01T11:24:27Z

I've expanded on your test case a little bit, and fixed a merge conflict on the AUTHORS.md file. Should be all set now!
Thank you for this work!

essandess force-pushed the tokenize_flags branch from 409f3f6 to 3aa1147 Compare August 22, 2022 14:21

essandess force-pushed the tokenize_flags branch from 3aa1147 to c6504c6 Compare August 23, 2022 19:12

essandess force-pushed the tokenize_flags branch from c6504c6 to b16e4bf Compare August 23, 2022 19:55

essandess and others added 2 commits September 1, 2022 12:18

Expanded on emoji flag test cases slightly

0770bf9

tomaarsen force-pushed the tokenize_flags branch from 71e97de to 0770bf9 Compare September 1, 2022 10:27

Merge branch 'develop' into tokenize_flags

59d3b99

tomaarsen merged commit c580f12 into nltk:develop Sep 1, 2022

essandess deleted the tokenize_flags branch September 1, 2022 22:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nltk.tokenize.casual.TweetTokenizer: Add support to tokenize emoji flag sequences #3034

nltk.tokenize.casual.TweetTokenizer: Add support to tokenize emoji flag sequences #3034

essandess commented Aug 22, 2022

tomaarsen commented Aug 22, 2022

essandess commented Aug 22, 2022

tomaarsen commented Aug 23, 2022

essandess commented Aug 23, 2022

tomaarsen commented Sep 1, 2022

nltk.tokenize.casual.TweetTokenizer: Add support to tokenize emoji flag sequences #3034

nltk.tokenize.casual.TweetTokenizer: Add support to tokenize emoji flag sequences #3034

Conversation

essandess commented Aug 22, 2022

tomaarsen commented Aug 22, 2022

essandess commented Aug 22, 2022

tomaarsen commented Aug 23, 2022

essandess commented Aug 23, 2022

tomaarsen commented Sep 1, 2022