Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: N to M restart capability #131

Open
MTCam opened this issue Jun 26, 2021 · 2 comments
Open

Feature request: N to M restart capability #131

MTCam opened this issue Jun 26, 2021 · 2 comments

Comments

@MTCam
Copy link
Contributor

MTCam commented Jun 26, 2021

The MIRGE-Com simulation application would like an N-to-M restart capability (i.e. where simulations originally run on N processors may be restarted on M processors) so that the simulations can adapt to changing resource availability.

This is an important capability for running production-scale problems on lab-based machines where resource availability is highly variable, and often includes dedicated pushes where we have temporary access to large portions of the machine.

The capability need not be fully integrated with the simulation code; it is OK if we need to run an "adapter" code in-between runs to serialize and then re-partition N-to-M.

@inducer
Copy link
Owner

inducer commented Jun 26, 2021

I can see the need, and I think it makes sense to have this.

If we allow ourselves access to the global mesh object at repartitioning time, then I think this could be quite straightforward. This of course means that the global, pre-partitioning mesh needs to be saved (pickled, probably) by someone, perhaps rank zero of the original computation. If it is available, we simply compute a new partition, using the same, existing partitioning functionality that did the original partitioning. This will give us the new connectivity. Then all that's needed is to shuffle the per-element DOF data around, and we're done.

If we choose to not save the original global mesh, then the cell connectivity needs to be re-divined from the distributed data structure, then we need to do a distributed re-partitioning, then re-compute the local connectivity data, then shuffle the DOF data, and done. Each of these steps would be new code.

Both ways are feasible, but they clearly differ in how much work is involved. If we can get away with the first option for now, I think this could be available reasonably soon.

cc @majosm as the person who might work on implementing this :)
cc @anderson2981 as the likely main customer, to help guide the discussion to something that's practically useful.

@MTCam
Copy link
Contributor Author

MTCam commented Jun 27, 2021

If we allow ourselves access to the global mesh object at repartitioning time, then I think this could be quite straightforward. This of course means that the global, pre-partitioning mesh needs to be saved (pickled, probably) by someone, perhaps rank zero of the original computation. If it is available, we simply compute a new partition, using the same, existing partitioning functionality that did the original partitioning. This will give us the new connectivity.

This is the preferred approach for meshes that do not change in time.

If we ever have moving parts, (e.g. a burning surface that regresses, structures that deform, etc), then we'll need to respect the time-advanced configurations for those meshes. It could be that in this paradigm we could represent the mesh changes in time with nodal fields that can be transferred along with the rest of the solution, which leads (moving meshes or not) to the crux of this feature request ...

Then all that's needed is to shuffle the per-element DOF data around, and we're done.

This is the crux of the N-to-M restart every time I've ever worked it. The mitigating factor here is that for fixed meshes, we have direct 1-to-1 mapping element-to-element. This means that the simple knowledge of an element-to-element map is enough to work out what data should go where.

I think that if we keep track of the decomp data through a decomp map file that gets written at partitioning time, then we can both write a relatively simple utility to serialize the simulation restart output into a single, serial file, and roll an integrated, inline N-to-M restart capability that you described.

With such decomp knowledge, we can easily serialize the simulation output in an intermediate step and then partition the solution data along with the mesh at startup time as a part of our nominal startup procedure. On the other hand - to skip intermediate processing and roll an integrated solution - one could use the decomp intelligence for N and for M decomps to resolve the element-to-element mapping required to transfer the data inline with the rest of the code following pretty much the first approach you outlined.

The intermediate processing approach (something like):

(serial mesh): Run on N  (write decomp map, restart files)
(decomp mesh, soln) -> (serial mesh, soln) # offline utility
(serial mesh. soln) Run on M  (write decomp map, restart files)

from my perspective seems like a reasonable approach with relatively few changes required to MIRGE-Com to support the main capability (i.e. the taking decomp grid, and solution to a serial grid and solution). Then MIRGE-Com only needs to be taught how to send the corresponding solution parts along with the mesh parts at startup/partition time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants