Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[torch.compile]: Enhanced Error Reporting and Performance Canary Mode #126644

Open
bhack opened this issue May 19, 2024 · 6 comments
Open

[torch.compile]: Enhanced Error Reporting and Performance Canary Mode #126644

bhack opened this issue May 19, 2024 · 6 comments
Labels
feature A request for a proper, new feature. oncall: pt2 triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@bhack
Copy link
Contributor

bhack commented May 19, 2024

馃殌 The feature, motivation and pitch

Background

Handling PyTorch compile issues and ensuring reproducibility on minimal isolated code is currently quite labor-intensive. This challenge impacts both:

  • Users and developers trying to isolate and reproduce errors.
  • Triagers or compiler team members working with third-party compiled code, especially for public OSS models.

The complexity increases significantly when compiling full models or high-level def functions in a chain. Often, a single error might be hidden within a chain of errors, complicating error reporting and resolution.

Proposal

  1. Enhanced Error Isolation and Reporting:

    • Isolate Failed Function:
      Implement a mechanism to exactly isolate the function where the compilation failed. This will allow users to report the specific function causing the issue without additional effort.
    • Record Fake Inputs:
      Automatically record fake inputs to facilitate error reproduction without the need for users to fully reproduce their dataset setup. This ensures that developers and triagers can recreate the issue reliably with minimal setup.
  2. Performance Canary Mode:

    • Store Baseline Info:
      Introduce a mode where running an uncompiled model stores baseline performance data (e.g., memory usage, speed) on disk.
    • Automatic Regression Detection:
      When running the compiled model, automatically compare current performance against the stored baseline. If there are regressions in memory usage or speed, users should be warned.
    • Simplified Reporting:
      In case of performance regressions, provide an easy and straightforward way for users to report these issues.

Benefits

  • For Users/Developers:
    • Simplifies the process of isolating and reporting compile errors.
    • Enhances reproducibility by automatically recording necessary inputs.
  • For Triagers/Compiler Team:
    • Provides clearer insights into the specific functions causing issues.
    • Facilitates quicker diagnosis and resolution of performance regressions.

/cc @ezyang @msaroufim @bdhirsh @anijain2305 @chauhang

Alternatives

No response

Additional context

No response

@xmfan xmfan added the feature A request for a proper, new feature. label May 20, 2024
@ezyang
Copy link
Contributor

ezyang commented May 21, 2024

Performance canary mode is a good idea, I often want information about baseline versus optimized comparison.

@bhack
Copy link
Contributor Author

bhack commented May 21, 2024

What about instead the first point?

@ezyang
Copy link
Contributor

ezyang commented May 21, 2024

Want to serialize MetaTensorDesc from fakeification, logical place is in structured_trace. Also good idea, not too difficult. Failed function should work already, we have user stacks and just report it.

@bhack
Copy link
Contributor Author

bhack commented May 21, 2024

Failed function should work already, we have user stacks and just report it.

But often triages still require minimal repro and it is a lot of work especially on intermediate/leaf function.
So when you have decorated/compiled an high level def and something it is going to fail in the compiled defs chain we need to to find a quickly way to report the issue without sharing everything for the reproducibility.

Another additional point is to have a compile deactivation decorator so that in the mean time we are going to open a ticket we could still disable the compilation of the failing def without doing a binary search in the code on the full def chain.
So in this case we could still use partially compiled working code or open new tickets in the compile failure backtrace.

@ezyang
Copy link
Contributor

ezyang commented May 29, 2024

Yeah, agreed. I definitely agree there is stuff holistically here we can do better.

@ezyang ezyang reopened this May 31, 2024
@zou3519 zou3519 added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module and removed triage review labels Jun 4, 2024
petrex pushed a commit to petrex/pytorch that referenced this issue Jun 5, 2024
This adds dumps of MetaTensorDesc and MetaStorageDesc to structured logs
when they are triggered from Dynamo.  The logs look like this:

```
V0522 08:13:25.267000 140224882566144 torch/_subclasses/meta_utils.py:195] {"describe_storage": {"id": 0, "describer_id": 0, "size": 32}, "frame_id": 0, "frame_compile_id": 0, "attempt": 0}
V0522 08:13:25.267000 140224882566144 torch/_subclasses/meta_utils.py:220] {"describe_tensor": {"id": 0, "ndim": 1, "dtype": "torch.float32", "device": "device(type='cpu')", "size": [8], "is_leaf": true, "stride": [1], "storage": 0, "view_func": "<built-in method _view_func_unsafe of Tensor object at 0x7f882959e840>", "describer_id": 0}, "frame_id": 0, "frame_compile_id": 0, "attempt": 0}
V0522 08:13:25.268000 140224882566144 torch/_subclasses/meta_utils.py:1594] {"describe_source": {"describer_id": 0, "id": 0, "source": "L['x']"}, "frame_id": 0, "frame_compile_id": 0, "attempt": 0}
```

The `describer_id` is used to disambiguate ids.  We expect it to be
unique per frame id, but if there is a bug it possibly is not.  Note you will get
redundant dumps when evaluation restarts.

tlparse can use this to give a visualization of input tensors to a
model, you could also use this to generate example inputs to run graphs
on.

Some care is taken to avoid redumping the tensor metadata multiple
times, which would happen ordinarily because AOTAutograd refakifies
everything after Dynamo, to deal with metadata mutation.

Partially fixes pytorch#126644

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: pytorch#126879
Approved by: https://github.com/jamesjwu
@bhack
Copy link
Contributor Author

bhack commented Jun 11, 2024

See also
#128134 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature A request for a proper, new feature. oncall: pt2 triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

5 participants