Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nltk.tokenize.casual.TweetTokenizer: Add support to tokenize emoji flag sequences #3034

Merged
merged 3 commits into from
Sep 1, 2022

Conversation

essandess
Copy link
Contributor

nltk.tokenize.casual.TweetTokenizer: Add support to tokenize emoji flag sequences

  • TweetTokenizer currently splits emoji flags into individual tokens of enclosed letters, e.g. 🇨🇦 -> '🇨 ', '🇦 '
  • This patch keeps emoji flag sequences intact
  • Without this PR:
nltk.tokenize.casual.TweetTokenizer().tokenize(text='Hi 🇨🇦, 😍!!')
['Hi', '🇨', '🇦', ',', '😍', '!', '!']
  • With this PR:
nltk.tokenize.casual.TweetTokenizer().tokenize(text='Hi 🇨🇦, 😍!!')
['Hi', '🇨🇦', ',', '😍', '!', '!']

@tomaarsen
Copy link
Member

See #2578 (comment) for some related discussion. In short, I am in favour of adding this behaviour, but I would like to start a discussion on whether it would be wise to include an option to turn it off in TweetTokenizer.__init__.

After all, as you are aware, flags are made up of regional indicator symbols. These symbols are also frequently used outside of the context of flags, like to spell out words: 🇦 🇵 🇵 🇱 🇪. However, this is commonly done with spaces between the tokens, as otherwise that spelling of "APPLE" would have had a Polish flag in the middle. In short, it may not be an issue.

Beyond that discussion, would you be able to write a few tests to back up that your implementation works as intended? For example in here: https://github.com/nltk/nltk/blob/develop/nltk/test/unit/test_tokenize.py

Also, I'm uncertain why the CI is failing. It seems like a third party download of a file used in one of the tests has failed:

curl: (28) Failed to connect to www.cs.unm.edu port 443: Connection timed out

It's unrelated to your PR, though, so no stress.

@essandess
Copy link
Contributor Author

would you be able to write a few tests to back up that your implementation works as intended? For example in here: https://github.com/nltk/nltk/blob/develop/nltk/test/unit/test_tokenize.py

Yes, I'll modify that file as part of this PR.

Given your comments, what do you think of the decision use a regex that encodes all pairs of enclosed letters as tokens? Under this behavior, your example is tokenized as follow. Is this "correct"? Or should this repo have a record of all ISO country code letter pairs and treat these separately in a much more complicated regex? My preference is to avoid this.

nltk.tokenize.casual.TweetTokenizer().tokenize('🇦🇵🇵🇱🇪')
['🇦🇵', '🇵🇱', '🇪']

@tomaarsen
Copy link
Member

I also prefer your current implementation, i.e. any two directly adjacent regional indicator symbols. This is also more future-proof. The behaviour from your little snippet seems like the expected output as well. 👍

@essandess
Copy link
Contributor Author

I also prefer your current implementation, i.e. any two directly adjacent regional indicator symbols. This is also more future-proof. The behaviour from your little snippet seems like the expected output as well. 👍

Great, implemented with a test in test_tokenize.py.

essandess and others added 2 commits September 1, 2022 12:18
…ag sequences

* TweetTokenizer currently splits emoji flags into individual tokens of enclosed letters, e.g. 🇨🇦 -> '🇨 ', '🇦 '
* This patch keeps emoji flag sequences intact
* Without this PR:
> ```python
> nltk.tokenize.casual.TweetTokenizer().tokenize(text='Hi 🇨🇦, 😍!!')
> ['Hi', '🇨', '🇦', ',', '😍', '!', '!']
> ```
* With this PR:
> ```python
> nltk.tokenize.casual.TweetTokenizer().tokenize(text='Hi 🇨🇦, 😍!!')
> ['Hi', '🇨🇦', ',', '😍', '!', '!']
> ```
@tomaarsen
Copy link
Member

I've expanded on your test case a little bit, and fixed a merge conflict on the AUTHORS.md file. Should be all set now!
Thank you for this work!

@tomaarsen tomaarsen merged commit c580f12 into nltk:develop Sep 1, 2022
@essandess essandess deleted the tokenize_flags branch September 1, 2022 22:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants