Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EOFError during parallel read but not sequential #11449

Open
tupui opened this issue Jun 1, 2023 · 37 comments
Open

EOFError during parallel read but not sequential #11449

tupui opened this issue Jun 1, 2023 · 37 comments

Comments

@tupui
Copy link

tupui commented Jun 1, 2023

Describe the bug

When building SciPy, I am now getting an EOFError on macOS (could reproduce on M1 and M2) with -j>1.

I am not sure what happened, I tried a few Python versions and a few versions of the doc dependencies and I always get the parallelism error. I am thinking now that maybe an update of macOS itself could have break something??

The issue is not present on Ubuntu.

Traceback
❯ python dev.py doc -j 4
💻  ninja -C /Users/tupui/Documents/Code/scipy/build -j8
ninja: Entering directory `/Users/tupui/Documents/Code/scipy/build'
[4/4] Generating scipy/generate-version with a custom command
Build OK
💻  meson install -C build --only-changed
Installing, see meson-install.log...
Installation OK
# for testing
# @echo installed scipy 2869717 matches git version 2869717; exit 1
mkdir -p build/html build/doctrees
LANG=C /opt/homebrew/Caskroom/mambaforge/base/envs/scipy-dev-311/bin/python -msphinx -WT --keep-going  -b html -d build/doctrees -j4  source build/html 
Running Sphinx v7.0.1
SciPy (VERSION 1.12.0.dev)
[autosummary] generating autosummary for: building/blas_lapack.rst, building/compilers_and_options.rst, building/cross_compilation.rst, building/distutils_equivalents.rst, building/index.rst, building/introspecting_a_build.rst, building/redistributable_binaries.rst, building/understanding_meson.rst, dev/api-dev/api-dev-toc.rst, dev/api-dev/nan_policy.rst, ..., tutorial/stats/discrete_zipf.rst, tutorial/stats/discrete_zipfian.rst, tutorial/stats/resampling.rst, tutorial/stats/sampling.rst, tutorial/stats/sampling_dau.rst, tutorial/stats/sampling_dgt.rst, tutorial/stats/sampling_hinv.rst, tutorial/stats/sampling_pinv.rst, tutorial/stats/sampling_srou.rst, tutorial/stats/sampling_tdr.rst
[autosummary] generating autosummary for: /Users/tupui/Documents/Code/scipy/doc/source/reference/generated/scipy.optimize.OptimizeResult.rst, /Users/tupui/Documents/Code/scipy/doc/source/reference/generated/scipy.optimize.RootResults.rst, /Users/tupui/Documents/Code/scipy/doc/source/reference/generated/scipy.sparse.bsr_array.rst, /Users/tupui/Documents/Code/scipy/doc/source/reference/generated/scipy.sparse.bsr_matrix.rst, /Users/tupui/Documents/Code/scipy/doc/source/reference/generated/scipy.sparse.coo_array.rst, /Users/tupui/Documents/Code/scipy/doc/source/reference/generated/scipy.sparse.coo_matrix.rst, /Users/tupui/Documents/Code/scipy/doc/source/reference/generated/scipy.sparse.csc_array.rst, /Users/tupui/Documents/Code/scipy/doc/source/reference/generated/scipy.sparse.csc_matrix.rst, /Users/tupui/Documents/Code/scipy/doc/source/reference/generated/scipy.sparse.csr_array.rst, /Users/tupui/Documents/Code/scipy/doc/source/reference/generated/scipy.sparse.csr_matrix.rst, /Users/tupui/Documents/Code/scipy/doc/source/reference/generated/scipy.sparse.dia_array.rst, /Users/tupui/Documents/Code/scipy/doc/source/reference/generated/scipy.sparse.dia_matrix.rst, /Users/tupui/Documents/Code/scipy/doc/source/reference/generated/scipy.sparse.dok_array.rst, /Users/tupui/Documents/Code/scipy/doc/source/reference/generated/scipy.sparse.dok_matrix.rst, /Users/tupui/Documents/Code/scipy/doc/source/reference/generated/scipy.sparse.lil_array.rst, /Users/tupui/Documents/Code/scipy/doc/source/reference/generated/scipy.sparse.lil_matrix.rst, /Users/tupui/Documents/Code/scipy/doc/source/reference/generated/scipy.stats._result_classes.PearsonRResult.rst, /Users/tupui/Documents/Code/scipy/doc/source/reference/generated/scipy.stats._result_classes.TtestResult.rst
loading intersphinx inventory from https://docs.python.org/3/objects.inv...
loading intersphinx inventory from https://numpy.org/devdocs/objects.inv...
loading intersphinx inventory from https://numpy.org/neps/objects.inv...
loading intersphinx inventory from https://matplotlib.org/stable/objects.inv...
loading intersphinx inventory from https://asv.readthedocs.io/en/stable/objects.inv...
loading intersphinx inventory from https://www.statsmodels.org/stable/objects.inv...
myst v0.18.1: MdParserConfig(commonmark_only=False, gfm_only=False, enable_extensions=[], disable_syntax=[], all_links_external=False, url_schemes=('http', 'https', 'mailto', 'ftp'), ref_domains=None, highlight_code_blocks=True, number_code_blocks=[], title_to_header=False, heading_anchors=None, heading_slug_func=None, footnote_transition=True, words_per_minute=200, sub_delimiters=('{', '}'), linkify_fuzzy_links=True, dmath_allow_labels=True, dmath_allow_space=True, dmath_allow_digits=True, dmath_double_inline=False, update_mathjax=True, mathjax_classes='tex2jax_process|mathjax_process|math|output_area')
myst-nb v0.17.2: NbParserConfig(custom_formats={}, metadata_key='mystnb', cell_metadata_key='mystnb', kernel_rgx_aliases={}, execution_mode='auto', execution_cache_path='', execution_excludepatterns=(), execution_timeout=30, execution_in_temp=False, execution_allow_errors=False, execution_raise_on_error=False, execution_show_tb=False, merge_streams=False, render_plugin='default', remove_code_source=False, remove_code_outputs=False, code_prompt_show='Show code cell {type}', code_prompt_hide='Hide code cell {type}', number_source_lines=False, output_stderr='show', render_text_lexer='myst-ansi', render_error_lexer='ipythontb', render_image_options={}, render_figure_options={}, render_markdown_format='commonmark', output_folder='build', append_css=True, metadata_to_fm=False)
Using jupyter-cache at: /Users/tupui/Documents/Code/scipy/doc/build/.jupyter_cache
building [mo]: targets for 0 po files that are out of date
writing output... 
building [html]: targets for 4430 source files that are out of date
updating environment: [new config] 4430 added, 0 changed, 0 removed
0.00s - Debugger warning: It seems that frozen modules are being used, which may
0.00s - make the debugger miss breakpoints. Please pass -Xfrozen_modules=off
0.00s - to python to disable frozen modules.
0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this validation.
0.00s - Debugger warning: It seems that frozen modules are being used, which may
0.00s - make the debugger miss breakpoints. Please pass -Xfrozen_modules=off
0.00s - to python to disable frozen modules.
0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this validation.
<string>:46: DeprecationWarning: invalid escape sequence '\,'
<string>:46: DeprecationWarning: invalid escape sequence '\,'
<string>:50: DeprecationWarning: invalid escape sequence '\d'
<string>:46: DeprecationWarning: invalid escape sequence '\l'

Traceback (most recent call last):
  File "/opt/homebrew/Caskroom/mambaforge/base/envs/scipy-dev-311/lib/python3.11/site-packages/sphinx/cmd/build.py", line 285, in build_main
    app.build(args.force_all, args.filenames)
  File "/opt/homebrew/Caskroom/mambaforge/base/envs/scipy-dev-311/lib/python3.11/site-packages/sphinx/application.py", line 351, in build
    self.builder.build_update()
  File "/opt/homebrew/Caskroom/mambaforge/base/envs/scipy-dev-311/lib/python3.11/site-packages/sphinx/builders/__init__.py", line 294, in build_update
    self.build(to_build,
  File "/opt/homebrew/Caskroom/mambaforge/base/envs/scipy-dev-311/lib/python3.11/site-packages/sphinx/builders/__init__.py", line 311, in build
    updated_docnames = set(self.read())
                           ^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/mambaforge/base/envs/scipy-dev-311/lib/python3.11/site-packages/sphinx/builders/__init__.py", line 416, in read
    self._read_parallel(docnames, nproc=self.app.parallel)
  File "/opt/homebrew/Caskroom/mambaforge/base/envs/scipy-dev-311/lib/python3.11/site-packages/sphinx/builders/__init__.py", line 472, in _read_parallel
    tasks.join()
  File "/opt/homebrew/Caskroom/mambaforge/base/envs/scipy-dev-311/lib/python3.11/site-packages/sphinx/util/parallel.py", line 99, in join
    if not self._join_one():
           ^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/mambaforge/base/envs/scipy-dev-311/lib/python3.11/site-packages/sphinx/util/parallel.py", line 117, in _join_one
    exc, logs, result = pipe.recv()
                        ^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/mambaforge/base/envs/scipy-dev-311/lib/python3.11/multiprocessing/connection.py", line 249, in recv
    buf = self._recv_bytes()
          ^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/mambaforge/base/envs/scipy-dev-311/lib/python3.11/multiprocessing/connection.py", line 413, in _recv_bytes
    buf = self._recv(4)
          ^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/mambaforge/base/envs/scipy-dev-311/lib/python3.11/multiprocessing/connection.py", line 382, in _recv
    raise EOFError
EOFError

Exception occurred:
  File "/opt/homebrew/Caskroom/mambaforge/base/envs/scipy-dev-311/lib/python3.11/multiprocessing/connection.py", line 382, in _recv
    raise EOFError
EOFError
The full traceback has been saved in /var/folders/5x/5blbwvrs6l5_wct5ncsp5f_m0000gn/T/sphinx-err-cfecful5.log, if you want to report the issue to the developers.
Please also report this if it was a user error, so that a better error message can be provided next time.
A bug report can be filed in the tracker at <https://github.com/sphinx-doc/sphinx/issues>. Thanks!
make: *** [html-build] Error 2

How to Reproduce

This can be hard to do... We have a few guides to build SciPy and the doc itself https://scipy.github.io/devdocs/building/index.html#building-from-source

If you do that, use conda/mamba as it's almost straightforward. And don't even try on Windows as it can be very hard to do. It "should" just be:

brew install gfortran openblas pkg-config
git clone https://github.com/scipy/scipy.git
git submodule update --init
mamba env create -f environment.yml
mamba activate scipy-dev
python dev.py build

Once you have a build, then to build the doc: python dev.py doc -j x

Environment Information

Platform:              darwin; (macOS-13.3.1-arm64-arm-64bit)
Python version:        3.11.0 | packaged by conda-forge | (main, Jan 14 2023, 12:26:40) [Clang 14.0.6 ])
Python implementation: CPython
Sphinx version:        7.0.1
Docutils version:      0.19
Jinja2 version:        3.1.2
Pygments version:      2.14.0

Sphinx extensions

extensions = [
    'sphinx.ext.autodoc',
    'sphinx.ext.autosummary',
    'sphinx.ext.coverage',
    'sphinx.ext.mathjax',
    'sphinx.ext.intersphinx',
    'numpydoc',
    'sphinx_design',
    'scipyoptdoc',
    'doi_role',
    'matplotlib.sphinxext.plot_directive',
    'myst_nb',
]

Additional context

No response

@picnixz
Copy link
Member

picnixz commented Jun 1, 2023

Can you check using another version of Sphinx ? like 6.2.1? or can you remove all non-sphinx extensions ? everything after intersphinx) ? Also, can you run the build with -vv so that we know "where" it failed (attach the full traceback).

@tupui
Copy link
Author

tupui commented Jun 1, 2023

@picnixz thank you for the reply.

I have the same behaviour with other versions of Sphinx. Here is the log with 6.2.1

log.txt

But good news, I tried to disable the extensions and enable them one after the other and the culprit seems to be matplotlib.sphinxext.plot_directive.

So unless something there a change of behaviour with Sphinx, I suppose this is more on Matplotlib's side that I need to raise this. Feel free to close 😃 Thanks again for the help.

@ksunden could you have a look?? 😄

@ksunden
Copy link
Contributor

ksunden commented Jun 1, 2023

I don't have a mac, but I've brought it up in mpl's channels

@tupui
Copy link
Author

tupui commented Jun 1, 2023

Thanks 🙏

@QuLogic
Copy link
Contributor

QuLogic commented Jun 1, 2023

Do you have a crash dump handler on macs? I suspect that this doesn't directly have anything to do with Sphinx or Matplotlib's plot directive, but rather that some plotted example is crashing via an incompatible openblas or similar library, and Sphinx's multiprocessing sees that as an EOF.

@picnixz
Copy link
Member

picnixz commented Jun 2, 2023

@tupui Thanks for the log but the log appears to be even less instructive xD I thought there would be more information about the traceback itself but it appears not (I don't know whether the traceback (usually stored in a temporary file) was what you actually gave me or whether it was the console output).

Concerning @QuLogic, I also think this may be the case. I had a quick look at the Sphinx directive but it doesn't appear to have been updated recently (last commit dates back to 5 months ago). So my take is that an external library is making it crash. Now, it appears that the env-merge-info event when hitting something between reference/generated/scipy.linalg.blas.csscal and reference/generated/scipy.linalg.blas.zhpr2 (I don't know which file in particular) so it might be worth looking at those files and try to build them separately (or only build scipy.linalg.blas.*).

@tupui
Copy link
Author

tupui commented Jun 2, 2023

Do you have a crash dump handler on Macs?

mmm not sure what that means 😅

Thank you both for the insights, now that I am getting convinced about an issue with a plot, I will try to git bissect to see which plot might be causing an issue and hopefully it would call some blas function. Since there appears to be no issues with serial build, I suppose it would have to do with blas losing it's mind with multiprocessing. I will do this now and keep you updated.

@tupui
Copy link
Author

tupui commented Jun 2, 2023

(Happy to move this issue elsewhere if wanted, let me know.)

So I did not manage to get a working build so I cannot do a proper bisect... Strange I have no clue what is happening, maybe really linked to a system deps or a version of macOS (I also tried with a previous version of openblas and also Python 3.10, but same issue.)

What worked is in the extension to change the flag parallel_read_safe to False. Then yes the read part is sequential (funny by the way, the warning about the read being sequential is printed at the end of the read phase, not before.)

@QuLogic do you have hints on how I can catch which example is failing? So far I did not manage to put a print/something at the right spot.

@tacaswell
Copy link
Contributor

I assume Elliott was referring to https://docs.python.org/3/library/faulthandler.html

Are we sure that the backend is being forced to agg by sphinx? GUI stuff in sub-processes can also lead to hard process death....

Where did you get all of your compiled things from (I assume you complied scipy yourself and everything else from CF)?

matplotlib/matplotlib#15410 may also be relevant?

@tupui
Copy link
Author

tupui commented Jun 2, 2023

@tacaswell thank you for having a look.

Yes I am compiling SciPy (I am a maintainer) and the rest is from Conda-forge.

The linked issue could be related, I tried mp.set_start_method('forkserver') (in conf.py and also directly at the beginning of Sphinx's parallel functions) but nothing.

I also thing I am using agg as we have this in our conf.py and I also tried to add it in plot_pre_code.


Otherwise, here is my journey 😅

I could pin point some code that would raise the error 100%:

    >>> import numpy as np
    >>> import matplotlib.pyplot as plt
    >>> from scipy.interpolate import RBFInterpolator
    >>> from scipy.stats.qmc import Halton

    >>> rng = np.random.default_rng()
    >>> xobs = 2*Halton(2, seed=rng).random(100) - 1
    >>> yobs = np.sum(xobs, axis=1)*np.exp(-6*np.sum(xobs**2, axis=1))

    >>> xgrid = np.mgrid[-1:1:50j, -1:1:50j]
    >>> xflat = xgrid.reshape(2, -1).T
    >>> yflat = RBFInterpolator(xobs, yobs)(xflat)
    >>> ygrid = yflat.reshape(50, 50)

    >>> fig, ax = plt.subplots()
    >>> ax.pcolormesh(*xgrid, ygrid, vmin=-0.25, vmax=0.25, shading='gouraud')
    >>> p = ax.scatter(*xobs.T, c=yobs, s=50, ec='k', vmin=-0.25, vmax=0.25)
    >>> fig.colorbar(p)
    >>> plt.show()

And the culprit here is the line with RBFInterpolator. If I remove it, it goes a bit further. And then there on to the next failure.

The "funny" thing is that using -j 2 was not making this code fail. And then I found another code that would fail which has no SciPy at all (this is slightly modified and also fails):

    >>> import numpy as np
    >>> import matplotlib.pyplot as plt
    >>> n_grid = 100
    >>> u = np.linspace(0, np.pi, n_grid)
    >>> v = np.linspace(0, 2 * np.pi, n_grid)
    >>> u_grid, v_grid = np.meshgrid(u, v)
    >>> vertices = np.stack([np.cos(v_grid) * np.sin(u_grid),
    ...                      np.sin(v_grid) * np.sin(u_grid),
    ...                      np.cos(u_grid)],
    ...                     axis=2)
    >>> x = np.outer(np.cos(v), np.sin(u))
    >>> y = np.outer(np.sin(v), np.sin(u))
    >>> z = np.outer(np.ones_like(u), np.cos(u))
    >>> fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(9, 4),
    ...                          subplot_kw={"projection": "3d"})
    >>> axes[0].plot_surface(x, y, z, rstride=1, cstride=1,
    ...                     linewidth=0)

This is failing with -j 8 but not with -j 2. And the line causing the issue is the plot_surface.

So I thought about using the example from Matplotlib's doc around plot_surface... and it works just fine. But, back to the code above, I tried to change n_grid = 10 and 🎉 no errors.

Hence, my conclusion for now is that it could be a memory limitation. Which could also explain the failure with RBFInterpolator as this is also memory hungry. I could confirm this by lowering the number of points used by the function and it was not failing anymore.

Do you have any advice than on how I can let it eat up the memory it needs??

@tacaswell
Copy link
Contributor

The other thing we have noted is that the cross-referencing parts of sphinx are really memory (from memory like 1G(!) per process) hungry.

Have you watched memory usage while it is running?

@tupui
Copy link
Author

tupui commented Jun 2, 2023

I can try a real memory profiler (would be a good excuse to try memray I suppose 😅), but from a top all processes are around 200Mo. If I use -j 2 I can see that the memory increases progressively to 300Mo and then it crashes.

Maybe there is a sudden pick that I am not seeing and it crashes suddenly.

@picnixz
Copy link
Member

picnixz commented Jun 3, 2023

EDIT: I am not entirely sure what I suggested may help you actually.

If you can modify the sources of the Sphinx code you are using, you can technically print the exact exception as follows:

# In sphinx/util/parallel.py (ParallelTasks class)

def _process(self, pipe: Any, func: Callable, arg: Any) -> None:
    try:
        collector = logging.LogCollector()
        with collector.collect():
            if arg is None:
                ret = func()
            else:
                ret = func(arg)
        failed = False
    except BaseException as err:
        print(traceback.format_exc())  # add this line
        failed = True
        errmsg = traceback.format_exception_only(err.__class__, err)[0].strip()
        ret = (errmsg, traceback.format_exc())
    logging.convert_serializable(collector.logs)
    pipe.send((failed, collector.logs, ret))

This is the location where any exception1 should technically be caught. Also, the traceback being shown with -T is not the traceback that was stored (and actually, this specific traceback is not formatted).

Footnotes

  1. Warnings treated as errors are not caught here because they are part of the collector.logs list. Their corresponding exception will be raised later upon calling the logging filter on the log records.

@tupui
Copy link
Author

tupui commented Jun 5, 2023

It seems like no exception is even being raised. Adding that does not do anything.

@picnixz
Copy link
Member

picnixz commented Jun 5, 2023

I see, then is really the pipe.recv() that's raising the exception. Perhaps, can you add
print(failed, collector.logs, ret) just after logging.convert_serializable to investigate the content of each call ? also, if you want, instead of sending pipe.send((failed, collector.logs, ret)), you could also send the process and thread id as pipe.send((os.getpid(), threading.get_ident(), failed, collector.logs, ret)), along the way so that when doing pipe.recv(), you can see what was actually received (if there is any chance for (failed, collector.logs, ret) to be the same). You'll need to change exc, logs, result = pipe.recv() (same file) to pid, tid, exc, logs, result = pipe.recv() and you can print the pid/tid afterwards.

@tupui
Copy link
Author

tupui commented Jun 5, 2023

print(failed, collector.logs, ret) is not printing anything. The process seems to be halting before that.

I also cannot inspect after pid, tid, exc, logs, result = pipe.recv() since it's breaking in the call to recv.

@picnixz
Copy link
Member

picnixz commented Jun 5, 2023

is not printing anything. The process seems to be halting before that.

Wait, how can the process halt before this? according to your original message, the exception occurred when doing pipe.recv(), so, doesn't it mean that nothing was received at all? Try to print before logging.convert_serializable(collector.logs) as well. Because if nothing is printed, then I really don't see how it is possible (I mean, you caught nothing in the except-block).

@tupui
Copy link
Author

tupui commented Jun 5, 2023

Yeah I am not sure either how this all works. Putting a print at the top of this function before the try block prints something and if I put another print after the block, nothing is printed.

@picnixz
Copy link
Member

picnixz commented Jun 5, 2023

If you can do it, try to debug each line step by step using a debugger. Otherwise, let's do the old way: print 1, 2, 3 etc after each line and check which line fails.

@tupui
Copy link
Author

tupui commented Jun 5, 2023

It's calling ret = func(arg) then crashes. I will try to hook up a dbg right before.

@picnixz
Copy link
Member

picnixz commented Jun 5, 2023

Then it's good to see that it's crashing there at least ! but how come the except-block is not triggered in that case ? check that traceback.print_exc() is actually not the empty string.

@tupui
Copy link
Author

tupui commented Jun 5, 2023

It's not even doing a simple print after the block. It's not an exception being raised, it seems like the process is just hard killed

@picnixz
Copy link
Member

picnixz commented Jun 5, 2023

  • When it crashes, does it print anything?
  • Try to run Python with -X faulthandler.

@tupui
Copy link
Author

tupui commented Jun 5, 2023

When it crashes it does not print more than the EOFError thing.

So far I was building with the makefile. I will see to directly call Python myself and then I can also use a proper debugger.

@tupui
Copy link
Author

tupui commented Jun 5, 2023

Running with -X faulthandler leads to the following segfault.

Fatal Python error: Segmentation fault

Current thread 0x00000001f708de00 (most recent call first):
  File "<__array_function__ internals>", line 200 in dot
  File "/Users/tupui/Documents/Code/scipy/build-install/lib/python3.11/site-packages/scipy/interpolate/_rbfinterp.py", line 460 in _chunk_evaluator
  File "/Users/tupui/Documents/Code/scipy/build-install/lib/python3.11/site-packages/scipy/interpolate/_rbfinterp.py", line 494 in __call__
  File "<string>", line 14 in <module>
  File "/opt/homebrew/Caskroom/mambaforge/base/envs/scipy-dev-311/lib/python3.11/site-packages/matplotlib/sphinxext/plot_directive.py", line 483 in _run_code
  File "/opt/homebrew/Caskroom/mambaforge/base/envs/scipy-dev-311/lib/python3.11/site-packages/matplotlib/sphinxext/plot_directive.py", line 600 in render_figures
  File "/opt/homebrew/Caskroom/mambaforge/base/envs/scipy-dev-311/lib/python3.11/site-packages/matplotlib/sphinxext/plot_directive.py", line 761 in run
  File "/opt/homebrew/Caskroom/mambaforge/base/envs/scipy-dev-311/lib/python3.11/site-packages/matplotlib/sphinxext/plot_directive.py", line 253 in run
  File "/opt/homebrew/Caskroom/mambaforge/base/envs/scipy-dev-311/lib/python3.11/site-packages/docutils/parsers/rst/states.py", line 2154 in run_directive

Seems like plot_directive is using exec under the hood and if there is an error here it's not able to exit gracefully.

@tupui
Copy link
Author

tupui commented Jul 4, 2023

@picnixz what should we do here?

@picnixz
Copy link
Member

picnixz commented Jul 4, 2023

Well.. I thought that you actually opened an issue on matplotlib's repo since its their directive that appears to be causing the issue (here we are only responsible for Sphinx core development and not extensions issues, so I assumed that my role was done).

If the issue is also on our side, then we would work on it but for now I would first ask MPL developers.

@tupui
Copy link
Author

tupui commented Jul 4, 2023

Fair enough 😅 although one could argue that Sphinx itself should be resilient of "mis-behaving" extensions and not crash without at least an error saying from where it's coming from. (I suppose it could be dealt with in #11451 too.)

@ksunden @tacaswell are you already looking at it from your perspective or shall I open an issue?

@QuLogic
Copy link
Contributor

QuLogic commented Jul 4, 2023

The problem is in SciPy (or its dependencies); it is crashing Python in an unrecoverable manner. There is nothing the plot directive extension could do to catch this, just as there is nothing regular Sphinx could do if just importing a module crashed the interpreter in a serial build.

@tupui
Copy link
Author

tupui commented Jul 4, 2023

@QuLogic I slightly disagree here. The extension is working in serial manner but not in parallel on macOS and on Ubuntu all is working. The way the extension is executing the example makes it very hard to know which example failed. I don't know what sort of error in SciPy there might be as there just seems to be a memory limitation with the way examples are run. Also it's not related to SciPy, if you read above I could reproduce with a very simple example with just NumPy.

@ngoldbaum
Copy link
Contributor

I just started hitting this with the numpy docs on a PR that made some docs changes to numpy.

@ngoldbaum
Copy link
Contributor

Here's the traceback I'm getting from faulthandler, which terminates in IPython (!!)

Fatal Python error: Segmentation faultnerated/numpy.lib.npyio.NpzFile .. reference/generated/numpy.linalg.vector_norm

Current thread 0x00000001dd5dd000 (most recent call first):
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/IPython/core/history.py", line 247 in init_db
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/IPython/core/history.py", line 77 in catch_corrupt_db
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/decorator.py", line 232 in fun
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/IPython/core/history.py", line 222 in __init__
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/IPython/core/history.py", line 542 in __init__
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/IPython/core/interactiveshell.py", line 1884 in init_history
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/IPython/core/interactiveshell.py", line 619 in __init__
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/traitlets/config/configurable.py", line 583 in instance
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/IPython/sphinxext/ipython_directive.py", line 364 in __init__
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/IPython/sphinxext/ipython_directive.py", line 964 in setup
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/IPython/sphinxext/ipython_directive.py", line 1006 in run
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/docutils/parsers/rst/states.py", line 2154 in run_directive
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/docutils/parsers/rst/states.py", line 2104 in directive
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/docutils/parsers/rst/states.py", line 2367 in explicit_construct
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/docutils/parsers/rst/states.py", line 2355 in explicit_markup
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/docutils/statemachine.py", line 445 in check_line
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/docutils/statemachine.py", line 233 in run
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/docutils/parsers/rst/states.py", line 195 in run
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/docutils/parsers/rst/states.py", line 279 in nested_parse
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/docutils/parsers/rst/states.py", line 391 in new_subsection
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/docutils/parsers/rst/states.py", line 325 in section
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/docutils/parsers/rst/states.py", line 2785 in underline
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/docutils/statemachine.py", line 445 in check_line
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/docutils/statemachine.py", line 233 in run
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/docutils/parsers/rst/states.py", line 195 in run
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/docutils/parsers/rst/states.py", line 279 in nested_parse
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/docutils/parsers/rst/states.py", line 391 in new_subsection
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/docutils/parsers/rst/states.py", line 325 in section
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/docutils/parsers/rst/states.py", line 2785 in underline
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/docutils/statemachine.py", line 445 in check_line
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/docutils/statemachine.py", line 233 in run
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/docutils/parsers/rst/states.py", line 195 in run
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/docutils/parsers/rst/states.py", line 279 in nested_parse
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/docutils/parsers/rst/states.py", line 391 in new_subsection
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/docutils/parsers/rst/states.py", line 325 in section
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/docutils/parsers/rst/states.py", line 2785 in underline
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/docutils/statemachine.py", line 445 in check_line
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/docutils/statemachine.py", line 233 in run
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/docutils/parsers/rst/states.py", line 169 in run
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/sphinx/parsers.py", line 81 in parse
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/docutils/readers/__init__.py", line 76 in parse
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/sphinx/io.py", line 105 in read
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/docutils/core.py", line 234 in publish
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/sphinx/builders/__init__.py", line 498 in read_doc
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/sphinx/builders/__init__.py", line 459 in read_process
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/sphinx/util/parallel.py", line 76 in _process
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/multiprocessing/process.py", line 108 in run
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/multiprocessing/process.py", line 314 in _bootstrap
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/multiprocessing/popen_fork.py", line 71 in _launch
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/multiprocessing/popen_fork.py", line 19 in __init__
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/multiprocessing/context.py", line 281 in _Popen
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/multiprocessing/process.py", line 121 in start
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/sphinx/util/parallel.py", line 135 in _join_one
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/sphinx/util/parallel.py", line 102 in join
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/sphinx/builders/__init__.py", line 474 in _read_parallel
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/sphinx/builders/__init__.py", line 418 in read
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/sphinx/builders/__init__.py", line 313 in build
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/sphinx/builders/__init__.py", line 293 in build_update
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/sphinx/application.py", line 355 in build
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/sphinx/cmd/build.py", line 298 in build_main
  File "/Users/goldbaum/.pyenv/versions/3.11.6/lib/python3.11/site-packages/sphinx/cmd/build.py", line 341 in main
  File "/Users/goldbaum/.pyenv/versions/3.11.6/bin/sphinx-build", line 8 in <module>

@ngoldbaum
Copy link
Contributor

Apparently my IPython SQLite history DB was corrupted? Anyway, doing ipython history clear at the command line fixed it. So I guess this isn't the exact same issue, but same symptom: when a sphinx parallel doc builder seg faults the error you get is extremely confusing, although given the above discussion I'm not sure what sphinx can do.

Maybe make it a little easier to run sphinx under faulthandler? I had to hack my makefile to get that to work.

@ngoldbaum
Copy link
Contributor

Ah the issue is that it's full-stop not legal to use sqlite3 across forks and multiprocessing uses fork.

@picnixz
Copy link
Member

picnixz commented Feb 14, 2024

Honestly, it's extremely difficult for us to exactly know what to do. Best shot is to have a troubleshooting page where we summarize what people tried to do and how they fixed the bugs (or a pinned issue maybe).

Maybe make it a little easier to run sphinx under faulthandler? I had to hack my makefile to get that to work.

That is something we can do however. Can you share your Makefile please?

@ngoldbaum
Copy link
Contributor

ngoldbaum commented Feb 14, 2024

Here's the diff for the numpy makefile:

diff --git a/doc/Makefile b/doc/Makefile
index 2f04c7084c..639312c095 100644
--- a/doc/Makefile
+++ b/doc/Makefile
@@ -8,11 +8,11 @@
 # evaluate it now to allow easier debugging when printing the variable

 PYVER:=$(shell python3 -c 'from sys import version_info as v; print("{0}.{1}".format(v[0], v[1]))')
-PYTHON = python$(PYVER)
+PYTHON = python$(PYVER) -X faulthandler

 # You can set these variables from the command line.
 SPHINXOPTS    ?=
-SPHINXBUILD   ?= LANG=C sphinx-build
+SPHINXBUILD   ?= LANG=C $(PYTHON) /Users/goldbaum/.pyenv/versions/3.11.6/bin/sphinx-build
 PAPER         ?=
 DOXYGEN       ?= doxygen
 # For merging a documentation archive into a git checkout of numpy/doc

I needed to make it so I could call the sphinx-build script with python directly, instead of indirectly through the shebang.

One thing that would help is if it were possible to run sphinx-build like e.g. python -m sphinx-build. Maybe it is already and I just didn't figure out the incantation?

@QuLogic
Copy link
Contributor

QuLogic commented Feb 15, 2024

As noted above the ?= assignments, you can set those on the command line already, so something like make SPHINXBUILD="LANG=C python -X faulthandler -m sphinx" should work. Though note this doesn't get all the other uses of PYTHON, but from what I can tell those are just preprocessing that probably isn't failing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants