New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EOFError
during parallel read but not sequential
#11449
Comments
Can you check using another version of Sphinx ? like 6.2.1? or can you remove all non-sphinx extensions ? everything after intersphinx) ? Also, can you run the build with |
@picnixz thank you for the reply. I have the same behaviour with other versions of Sphinx. Here is the log with 6.2.1 But good news, I tried to disable the extensions and enable them one after the other and the culprit seems to be So unless something there a change of behaviour with Sphinx, I suppose this is more on Matplotlib's side that I need to raise this. Feel free to close 😃 Thanks again for the help. @ksunden could you have a look?? 😄 |
I don't have a mac, but I've brought it up in mpl's channels |
Thanks 🙏 |
Do you have a crash dump handler on macs? I suspect that this doesn't directly have anything to do with Sphinx or Matplotlib's plot directive, but rather that some plotted example is crashing via an incompatible openblas or similar library, and Sphinx's multiprocessing sees that as an EOF. |
@tupui Thanks for the log but the log appears to be even less instructive xD I thought there would be more information about the traceback itself but it appears not (I don't know whether the traceback (usually stored in a temporary file) was what you actually gave me or whether it was the console output). Concerning @QuLogic, I also think this may be the case. I had a quick look at the Sphinx directive but it doesn't appear to have been updated recently (last commit dates back to 5 months ago). So my take is that an external library is making it crash. Now, it appears that the |
mmm not sure what that means 😅 Thank you both for the insights, now that I am getting convinced about an issue with a plot, I will try to git bissect to see which plot might be causing an issue and hopefully it would call some blas function. Since there appears to be no issues with serial build, I suppose it would have to do with blas losing it's mind with multiprocessing. I will do this now and keep you updated. |
(Happy to move this issue elsewhere if wanted, let me know.) So I did not manage to get a working build so I cannot do a proper bisect... Strange I have no clue what is happening, maybe really linked to a system deps or a version of macOS (I also tried with a previous version of openblas and also Python 3.10, but same issue.) What worked is in the extension to change the flag @QuLogic do you have hints on how I can catch which example is failing? So far I did not manage to put a print/something at the right spot. |
I assume Elliott was referring to https://docs.python.org/3/library/faulthandler.html Are we sure that the backend is being forced to Where did you get all of your compiled things from (I assume you complied scipy yourself and everything else from CF)? matplotlib/matplotlib#15410 may also be relevant? |
@tacaswell thank you for having a look. Yes I am compiling SciPy (I am a maintainer) and the rest is from Conda-forge. The linked issue could be related, I tried I also thing I am using Otherwise, here is my journey 😅 I could pin point some code that would raise the error 100%: >>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> from scipy.interpolate import RBFInterpolator
>>> from scipy.stats.qmc import Halton
>>> rng = np.random.default_rng()
>>> xobs = 2*Halton(2, seed=rng).random(100) - 1
>>> yobs = np.sum(xobs, axis=1)*np.exp(-6*np.sum(xobs**2, axis=1))
>>> xgrid = np.mgrid[-1:1:50j, -1:1:50j]
>>> xflat = xgrid.reshape(2, -1).T
>>> yflat = RBFInterpolator(xobs, yobs)(xflat)
>>> ygrid = yflat.reshape(50, 50)
>>> fig, ax = plt.subplots()
>>> ax.pcolormesh(*xgrid, ygrid, vmin=-0.25, vmax=0.25, shading='gouraud')
>>> p = ax.scatter(*xobs.T, c=yobs, s=50, ec='k', vmin=-0.25, vmax=0.25)
>>> fig.colorbar(p)
>>> plt.show() And the culprit here is the line with The "funny" thing is that using >>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> n_grid = 100
>>> u = np.linspace(0, np.pi, n_grid)
>>> v = np.linspace(0, 2 * np.pi, n_grid)
>>> u_grid, v_grid = np.meshgrid(u, v)
>>> vertices = np.stack([np.cos(v_grid) * np.sin(u_grid),
... np.sin(v_grid) * np.sin(u_grid),
... np.cos(u_grid)],
... axis=2)
>>> x = np.outer(np.cos(v), np.sin(u))
>>> y = np.outer(np.sin(v), np.sin(u))
>>> z = np.outer(np.ones_like(u), np.cos(u))
>>> fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(9, 4),
... subplot_kw={"projection": "3d"})
>>> axes[0].plot_surface(x, y, z, rstride=1, cstride=1,
... linewidth=0) This is failing with So I thought about using the example from Matplotlib's doc around Hence, my conclusion for now is that it could be a memory limitation. Which could also explain the failure with Do you have any advice than on how I can let it eat up the memory it needs?? |
The other thing we have noted is that the cross-referencing parts of sphinx are really memory (from memory like 1G(!) per process) hungry. Have you watched memory usage while it is running? |
I can try a real memory profiler (would be a good excuse to try memray I suppose 😅), but from a top all processes are around 200Mo. If I use Maybe there is a sudden pick that I am not seeing and it crashes suddenly. |
EDIT: I am not entirely sure what I suggested may help you actually. If you can modify the sources of the Sphinx code you are using, you can technically print the exact exception as follows: # In sphinx/util/parallel.py (ParallelTasks class)
def _process(self, pipe: Any, func: Callable, arg: Any) -> None:
try:
collector = logging.LogCollector()
with collector.collect():
if arg is None:
ret = func()
else:
ret = func(arg)
failed = False
except BaseException as err:
print(traceback.format_exc()) # add this line
failed = True
errmsg = traceback.format_exception_only(err.__class__, err)[0].strip()
ret = (errmsg, traceback.format_exc())
logging.convert_serializable(collector.logs)
pipe.send((failed, collector.logs, ret)) This is the location where any exception1 should technically be caught. Also, the traceback being shown with Footnotes
|
It seems like no exception is even being raised. Adding that does not do anything. |
I see, then is really the |
I also cannot inspect after |
Wait, how can the process halt before this? according to your original message, the exception occurred when doing |
Yeah I am not sure either how this all works. Putting a print at the top of this function before the try block prints something and if I put another print after the block, nothing is printed. |
If you can do it, try to debug each line step by step using a debugger. Otherwise, let's do the old way: print 1, 2, 3 etc after each line and check which line fails. |
It's calling |
Then it's good to see that it's crashing there at least ! but how come the except-block is not triggered in that case ? check that |
It's not even doing a simple print after the block. It's not an exception being raised, it seems like the process is just hard killed |
|
When it crashes it does not print more than the EOFError thing. So far I was building with the makefile. I will see to directly call Python myself and then I can also use a proper debugger. |
Running with
Seems like |
@picnixz what should we do here? |
Well.. I thought that you actually opened an issue on matplotlib's repo since its their directive that appears to be causing the issue (here we are only responsible for Sphinx core development and not extensions issues, so I assumed that my role was done). If the issue is also on our side, then we would work on it but for now I would first ask MPL developers. |
Fair enough 😅 although one could argue that Sphinx itself should be resilient of "mis-behaving" extensions and not crash without at least an error saying from where it's coming from. (I suppose it could be dealt with in #11451 too.) @ksunden @tacaswell are you already looking at it from your perspective or shall I open an issue? |
The problem is in SciPy (or its dependencies); it is crashing Python in an unrecoverable manner. There is nothing the plot directive extension could do to catch this, just as there is nothing regular Sphinx could do if just importing a module crashed the interpreter in a serial build. |
@QuLogic I slightly disagree here. The extension is working in serial manner but not in parallel on macOS and on Ubuntu all is working. The way the extension is executing the example makes it very hard to know which example failed. I don't know what sort of error in SciPy there might be as there just seems to be a memory limitation with the way examples are run. Also it's not related to SciPy, if you read above I could reproduce with a very simple example with just NumPy. |
I just started hitting this with the numpy docs on a PR that made some docs changes to numpy. |
Here's the traceback I'm getting from faulthandler, which terminates in IPython (!!)
|
Apparently my IPython SQLite history DB was corrupted? Anyway, doing Maybe make it a little easier to run sphinx under faulthandler? I had to hack my makefile to get that to work. |
Ah the issue is that it's full-stop not legal to use sqlite3 across forks and multiprocessing uses fork. |
Honestly, it's extremely difficult for us to exactly know what to do. Best shot is to have a troubleshooting page where we summarize what people tried to do and how they fixed the bugs (or a pinned issue maybe).
That is something we can do however. Can you share your Makefile please? |
Here's the diff for the numpy makefile:
I needed to make it so I could call the One thing that would help is if it were possible to run |
As noted above the |
Describe the bug
When building SciPy, I am now getting an
EOFError
on macOS (could reproduce on M1 and M2) with-j
>1.I am not sure what happened, I tried a few Python versions and a few versions of the doc dependencies and I always get the parallelism error. I am thinking now that maybe an update of macOS itself could have break something??
The issue is not present on Ubuntu.
Traceback
How to Reproduce
This can be hard to do... We have a few guides to build SciPy and the doc itself https://scipy.github.io/devdocs/building/index.html#building-from-source
If you do that, use conda/mamba as it's almost straightforward. And don't even try on Windows as it can be very hard to do. It "should" just be:
Once you have a build, then to build the doc:
python dev.py doc -j x
Environment Information
Sphinx extensions
Additional context
No response
The text was updated successfully, but these errors were encountered: