Resolve the last failing doctests, add doctests to CI #3054

tomaarsen · 2022-09-27T08:14:02Z

Resolves #2989

Hello!

Pull request overview

Fix all tests from this list:

====================================================================================== short test summary info ====================================================================================== 
FAILED nltk/text.py::nltk.text.Text.collocation_list
FAILED nltk/draw/table.py::nltk.draw.table.MultiListbox.__init__
FAILED nltk/draw/table.py::nltk.draw.table.MultiListbox.configure
FAILED nltk/draw/table.py::nltk.draw.table.Table
FAILED nltk/parse/corenlp.py::nltk.parse.corenlp.CoreNLPDependencyParser
FAILED nltk/parse/corenlp.py::nltk.parse.corenlp.CoreNLPParser
FAILED nltk/parse/corenlp.py::nltk.parse.corenlp.GenericCoreNLPParser.tag
FAILED nltk/parse/corenlp.py::nltk.parse.corenlp.GenericCoreNLPParser.tokenize
FAILED nltk/parse/transitionparser.py::nltk.parse.transitionparser.demo
FAILED nltk/tokenize/punkt.py::nltk.tokenize.punkt.PunktSentenceTokenizer._match_potential_end_contexts
========================================================== 10 failed, 689 passed, 29 skipped, 9 xfailed, 22 warnings in 143.19s (0:02:23) ===========================================================

Add automatic doctests to the CI, making us go from 489 to 686 automatically passing tests 🎉
Update requirements-ci.txt with click, python-crfsuite and tqdm as they are required for doctests.
Skip additional doctests for tkinter, as tkinter.Tk() requires a display to exist.
Replace OSError with LookupError when SENNA can't be found.
Resolve breaking issue in MultiListBox that prevents it from being instantiated.
Update the Stanford CoreNLP version used in the CI from full 2018-10-05 to the most recent 4.5.1.
Void the third party CI cache every time tools/github_actions/third-party.sh is updated.
Add a check_jar function which calls pytest.skip if a jar is unavailable, and otherwise does nothing.

Details

The changes are primarily very small and few and far between, so I'm going to add comments to the changed code rather than describing the changes in this PR text itself. That way, it's easier to see which comment refers to what piece of code. The only exception is to CoreNLP, which is a somewhat big change:

CoreNLP

I've reworked the doctests in that file to include server = CoreNLPServer(...), server.start(), server.stop() or with CoreNLPServer() as server:. That way, the tests actually do pass now if the CoreNLP jars are on CLASSPATH.

Thanks to @ekaf and @Madnex for working on the other doctests & compiling the broken ones.

Tom Aarsen

The space was always just filtered away by '_tokenize_words', so it does not affect the tokenizer. However, it should improve the output of debug_decisions, as there's now no longer the useless space in the 'text' field. Beyond that, it better matches the original vision of the method.

I'm not a big fan of writing to files without the 'with open(...) as f' context manager notation, but this will do just fine.

This works when doing pytest --doctest-modules ./nltk/, but not when calling it for exclusively ./nltk/text.py

Also fix a critical bug in MultiListBox that prevents it from being instantiated. I guess nobody has used this at all?

…PATH If not, they are skipped. Sadly this does make the docstrings a bit more confusing

The moment we've been working towards is here!

These can throw LookupError's if the SENNA env variable is defined, like in the CI. Long story short, it's easiest to skip these, and we won't miss out on test coverage, as Senna should already be tested in test/unit/test_senna.py by the Linux workers

tomaarsen · 2022-09-27T08:15:32Z

.github/workflows/ci.yaml

@@ -125,4 +125,4 @@ jobs:
      - name: Run pytest
        shell: bash
        run: |
-          pytest --numprocesses auto -rsx --doctest-modules nltk/test
+          pytest --numprocesses auto -rsx --doctest-modules nltk


--doctest-modules was already always on, but it didn't do much, as it only looked in nltk/test. Looking in nltk directly was broken until #3033, but it works now!

Thanks @tomaarsen, this should make CI much more robust. I did not expect that it would be ready so soon!

tomaarsen · 2022-09-27T08:17:11Z

nltk/classify/senna.py

@@ -30,9 +30,9 @@
 Note: Unit tests for this module can be found in test/unit/test_senna.py

 >>> from nltk.classify import Senna
->>> pipeline = Senna('/usr/share/senna-v3.0', ['pos', 'chk', 'ner'])
+>>> pipeline = Senna('/usr/share/senna-v3.0', ['pos', 'chk', 'ner'])  # doctest: +SKIP


All Senna(...) sections are skipped, as they break on the Windows and Mac CI's, because the SENNA environment variable does exist, but points to a path that doesn't.

tomaarsen · 2022-09-27T08:18:18Z

nltk/classify/senna.py

@@ -62,7 +62,7 @@ def __init__(self, senna_path, operations, encoding="utf-8"):
                self._path = path.normpath(environ["SENNA"]) + sep
                exe_file_2 = self.executable(self._path)
                if not path.isfile(exe_file_2):
-                    raise OSError(
+                    raise LookupError(


I believe that LookupError is more logical here. We also use LookupError in other cases where looking up a certain resource fails.

tomaarsen · 2022-09-27T08:18:44Z

nltk/corpus/reader/markdown.py

@@ -206,6 +204,8 @@ def concatenated_view(self, reader, fileids, categories):
        )

    def metadata_reader(self, stream):
+        from yaml import safe_load
+


This way we don't need to add pyyaml to the requirements unnecessarily.

tomaarsen · 2022-09-27T08:19:54Z

nltk/draw/table.py

-                >>> master = SpaceWidget(c, 0, 30)
-                >>> mlb = MultiListbox(master, 5, label_foreground='red')
+                >>> root = Tk()  # doctest: +SKIP
+                >>> MultiListbox(root, ["Subject", "Sender", "Date"], label_foreground='red').pack()  # doctest: +SKIP


The old code didn't really make sense: I thought master or root was always a tkinter.Tk() instance. Beyond that, it didn't compile I believe.

I also have to skip these, as Tk() fails on a device without a display.

tomaarsen · 2022-09-27T08:21:50Z

nltk/parse/transitionparser.py

@@ -752,6 +752,7 @@ def demo():
    >>> parser_std.train([gold_sent],'temp.arcstd.model', verbose=False)
     Number of training examples : 1
     Number of valid (projective) examples : 1
+    >>> input_file.close()


Temporary files were being removed before they were closed, causing exceptions (and leftover files on my device...)

tomaarsen · 2022-09-27T08:22:15Z

nltk/text.py

@@ -270,8 +270,7 @@ def findall(self, regexp):
        a single token must be surrounded by angle brackets.  E.g.

        >>> from nltk.text import TokenSearcher
-        >>> print('hack'); from nltk.book import text1, text5, text9
-        hack
+        >>> from nltk.book import text1, text5, text9


The hack is unnecessary here. The tests pass like this, too.

tomaarsen · 2022-09-27T08:24:45Z

nltk/text.py

-            text7: Wall Street Journal
-            text8: Personals Corpus
-            text9: The Man Who Was Thursday by G . K . Chesterton 1908
+            >>> from nltk.book import text4


This one is a bit tricky. nltk.book prints some stuff when imported, which is what we were testing for. However, it only prints it the first time it is imported.
if you do pytest --doctest-modules nltk/text.py, then it imports nltk.book for the first time here, and then it expects some printed output, but if you do pytest --doctest-modules ./nltk, then it imports it elsewhere differently (probably in some pytest code that traverses everything under ./nltk), and then we expect this import to print nothing.

Long story short: the new situation passes on pytest --doctest-modules nltk (which is the important case), but fails on pytest --doctest-modules nltk/text.py (which is the unimportant case).

tomaarsen · 2022-09-27T08:26:38Z

nltk/tokenize/punkt.py

-                last_space_index = previous_slice.start
-            prev_word_slice = slice(last_space_index, match.start())
+                index_after_last_space = previous_slice.start
+            prev_word_slice = slice(index_after_last_space, match.start())


This method, which is supposed to return a "context" word that comes before a specific token of interest, used to also return the whitespace that preceded that word. I've fixed this now, so it doesn't return the whitespace anymore. It should not affect the tokenization, as that already strip()-ed away the extra whitespace anyways.

tomaarsen · 2022-09-27T08:26:55Z

requirements-ci.txt

 regex
 scikit-learn
+tqdm
 twython


These are (now) required to run the CI due to the doctests.

tomaarsen · 2022-09-27T08:31:38Z

I'm going to make some small changes to how the CoreNLP tests are skipped. Currently, they are all skipped in the CI, which I also want to change.

That way, if that file changes, it should cause it to refresh the cache

ekaf · 2022-09-28T12:10:05Z

This looks good!

pytest -vv --doctest-modules --doctest-continue-on-failure ./nltk/

============================= test session starts ==============================
platform linux -- Python 3.9.2, pytest-7.1.2, pluggy-1.0.0 -- /usr/bin/python3
[...]
===== 687 passed, 41 skipped, 9 xfailed, 22 warnings in 417.37s (0:06:57) ======

stevenbird · 2022-10-04T07:04:40Z

Nice work @tomaarsen! Thanks @ekaf.

tomaarsen added 14 commits September 21, 2022 15:57

Close tempfile before attempting to remove it

be3503a

I'm not a big fan of writing to files without the 'with open(...) as f' context manager notation, but this will do just fine.

Fix text.py doctests by expecting nothing

55b9edf

This works when doing pytest --doctest-modules ./nltk/, but not when calling it for exclusively ./nltk/text.py

Fix table.py doctests through a skip and root = Tk()

1a1dc22

Also fix a critical bug in MultiListBox that prevents it from being instantiated. I guess nobody has used this at all?

Rework CoreNLP tests for 4.5.1, make them work if CoreNLP is on CLASS…

1f4a121

…PATH If not, they are skipped. Sadly this does make the docstrings a bit more confusing

Add doctests to the test suite

b53559a

The moment we've been working towards is here!

Disable Tk() calls as they do not work on CI

e7bc18a

Add new doctest requirements to requirements-ci.txt

41c7091

pycrfsuite -> python-crfsuite in requirements-ci.txt

ed956b6

Move yaml dependency away into the function that uses it

f025c71

Skip additional tkinter tests as they can't run on the CI

a15fd7c

Add tqdm as a requirement for the ci

c86b3ba

Replace OSError with LookupError when SENNA executable is not found

b3448ae

tomaarsen added tests CI labels Sep 27, 2022

tomaarsen commented Sep 27, 2022

View reviewed changes

tomaarsen marked this pull request as draft September 27, 2022 08:31

tomaarsen added 3 commits September 27, 2022 13:18

Add check_jar function, and use it in the CoreNLP tests

2009579

Update Stanford CoreNLP version for CI tests

d7a9d75

Include a hash of third-party.sh in the third party tools cache

1bfd34e

That way, if that file changes, it should cause it to refresh the cache

tomaarsen marked this pull request as ready for review September 27, 2022 14:00

ekaf mentioned this pull request Sep 28, 2022

Failing tests in DocStrings #2989

Closed

stevenbird merged commit a2af434 into nltk:develop Oct 4, 2022

tomaarsen deleted the tests/doctests branch October 4, 2022 07:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolve the last failing doctests, add doctests to CI #3054

Resolve the last failing doctests, add doctests to CI #3054

tomaarsen commented Sep 27, 2022 •

edited

tomaarsen Sep 27, 2022

ekaf Sep 28, 2022

stevenbird Oct 4, 2022

tomaarsen Sep 27, 2022

tomaarsen Sep 27, 2022

tomaarsen Sep 27, 2022

tomaarsen Sep 27, 2022

tomaarsen Sep 27, 2022

tomaarsen Sep 27, 2022

tomaarsen Sep 27, 2022

stevenbird Oct 4, 2022

tomaarsen Sep 27, 2022

tomaarsen Sep 27, 2022

tomaarsen commented Sep 27, 2022

ekaf commented Sep 28, 2022

stevenbird commented Oct 4, 2022

Resolve the last failing doctests, add doctests to CI #3054

Resolve the last failing doctests, add doctests to CI #3054

Conversation

tomaarsen commented Sep 27, 2022 • edited

Pull request overview

Details

CoreNLP

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomaarsen commented Sep 27, 2022

ekaf commented Sep 28, 2022

stevenbird commented Oct 4, 2022

tomaarsen commented Sep 27, 2022 •

edited