-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
[torch.compile]: Enhanced Error Reporting and Performance Canary Mode #126644
Comments
Performance canary mode is a good idea, I often want information about baseline versus optimized comparison. |
What about instead the first point? |
Want to serialize MetaTensorDesc from fakeification, logical place is in structured_trace. Also good idea, not too difficult. Failed function should work already, we have user stacks and just report it. |
But often triages still require minimal repro and it is a lot of work especially on intermediate/leaf function. Another additional point is to have a compile deactivation decorator so that in the mean time we are going to open a ticket we could still disable the compilation of the failing |
Yeah, agreed. I definitely agree there is stuff holistically here we can do better. |
This adds dumps of MetaTensorDesc and MetaStorageDesc to structured logs when they are triggered from Dynamo. The logs look like this: ``` V0522 08:13:25.267000 140224882566144 torch/_subclasses/meta_utils.py:195] {"describe_storage": {"id": 0, "describer_id": 0, "size": 32}, "frame_id": 0, "frame_compile_id": 0, "attempt": 0} V0522 08:13:25.267000 140224882566144 torch/_subclasses/meta_utils.py:220] {"describe_tensor": {"id": 0, "ndim": 1, "dtype": "torch.float32", "device": "device(type='cpu')", "size": [8], "is_leaf": true, "stride": [1], "storage": 0, "view_func": "<built-in method _view_func_unsafe of Tensor object at 0x7f882959e840>", "describer_id": 0}, "frame_id": 0, "frame_compile_id": 0, "attempt": 0} V0522 08:13:25.268000 140224882566144 torch/_subclasses/meta_utils.py:1594] {"describe_source": {"describer_id": 0, "id": 0, "source": "L['x']"}, "frame_id": 0, "frame_compile_id": 0, "attempt": 0} ``` The `describer_id` is used to disambiguate ids. We expect it to be unique per frame id, but if there is a bug it possibly is not. Note you will get redundant dumps when evaluation restarts. tlparse can use this to give a visualization of input tensors to a model, you could also use this to generate example inputs to run graphs on. Some care is taken to avoid redumping the tensor metadata multiple times, which would happen ordinarily because AOTAutograd refakifies everything after Dynamo, to deal with metadata mutation. Partially fixes pytorch#126644 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#126879 Approved by: https://github.com/jamesjwu
See also |
馃殌 The feature, motivation and pitch
Background
Handling PyTorch compile issues and ensuring reproducibility on minimal isolated code is currently quite labor-intensive. This challenge impacts both:
The complexity increases significantly when compiling full models or high-level
def
functions in a chain. Often, a single error might be hidden within a chain of errors, complicating error reporting and resolution.Proposal
Enhanced Error Isolation and Reporting:
Implement a mechanism to exactly isolate the function where the compilation failed. This will allow users to report the specific function causing the issue without additional effort.
Automatically record fake inputs to facilitate error reproduction without the need for users to fully reproduce their dataset setup. This ensures that developers and triagers can recreate the issue reliably with minimal setup.
Performance Canary Mode:
Introduce a mode where running an uncompiled model stores baseline performance data (e.g., memory usage, speed) on disk.
When running the compiled model, automatically compare current performance against the stored baseline. If there are regressions in memory usage or speed, users should be warned.
In case of performance regressions, provide an easy and straightforward way for users to report these issues.
Benefits
/cc @ezyang @msaroufim @bdhirsh @anijain2305 @chauhang
Alternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: