Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Grudge array context #28

Draft
wants to merge 313 commits into
base: main
Choose a base branch
from

Conversation

nchristensen
Copy link

@nchristensen nchristensen commented Nov 10, 2020

@nchristensen nchristensen marked this pull request as draft November 10, 2020 11:49
@nchristensen
Copy link
Author

Somehow these three tests cause threads to crash in the CI. Any ideas what is happening here @inducer? The tests appear to run fine on Koelsch.

=================================== FAILURES ===================================
_____________________________ test/test_grudge.py ______________________________
[gw3] linux -- Python 3.8.6 /home/runner/work/grudge/grudge/.miniforge3/envs/testing/bin/python3
worker 'gw3' crashed while running "test/test_grudge.py::test_mass_surface_area[<array context factory for <pyopencl.Device 'pthread-Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz' on 'Portable Computing Language'>-box3d]"
_____________________________ test/test_grudge.py ______________________________
[gw2] linux -- Python 3.8.6 /home/runner/work/grudge/grudge/.miniforge3/envs/testing/bin/python3
worker 'gw2' crashed while running "test/test_grudge.py::test_tri_diff_mat[<array context factory for <pyopencl.Device 'pthread-Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz' on 'Portable Computing Language'>-3]"
_____________________________ test/test_grudge.py ______________________________
[gw1] linux -- Python 3.8.6 /home/runner/work/grudge/grudge/.miniforge3/envs/testing/bin/python3
worker 'gw1' crashed while running "test/test_grudge.py::test_convergence_maxwell[<array context factory for <pyopencl.Device 'pthread-Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz' on 'Portable Computing Language'>-4]"

@nchristensen
Copy link
Author

It may be a memory issue. If I set ulimit -S -v 1000000 it fails in a similar manner on Koelsch.

@nchristensen
Copy link
Author

It may be a memory issue. If I set ulimit -S -v 1000000 it fails in a similar manner on Koelsch.

As it turns out all of the tests fail with this ulimit so this isn't a good way to replicate the problem. Is there a way to run interactively in the CI environment @inducer?

@nchristensen
Copy link
Author

nchristensen commented Jan 15, 2021

I may have just stumbled on the problem. Let me check more.

@nchristensen
Copy link
Author

nchristensen commented Jan 18, 2021

It appears to be a problem with POCL in particular. The old AMD CPU backend works fine but the POCL gives a bus error and abandons the test.

edit: The Intel OpenCL implementation fails on the same tests but does not terminate the tests after the first failure.

@nchristensen
Copy link
Author

nchristensen commented Jan 18, 2021

It appears to be a problem with POCL in particular. The old AMD CPU backend works fine but the POCL gives a bus error and abandons the test.

edit: The Intel OpenCL implementation fails on the same tests but does not terminate the tests after the first failure.

The PoCL and Intel failures seemed to be due to over-allocation of local memory; apparently local and private memory are the same in these backends.

Unfortunately tests still break in CI but for different reasons than before.

@nchristensen
Copy link
Author

We may want to consider making a GPU available for CI.

Base automatically changed from master to main March 8, 2021 05:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants