Don't override terminal observation when using AutoResetWrapper #69

PavelCz · 2022-12-28T13:43:52Z

In general, the total number of observations in an episode is always 1 + the number of transitions/actions, because there is always a final next_state observation, which an agent does not act on. The AutoResetWrapper effectively combines several of the episodes of the underlying environment into a single continuing episode.
gym does not anticipate this use case. It will only return a single observation per transition in addition to the very first observation that gets passed the agent, which is generally accessed by calling .reset() after an episode is done. Consequently, if we combine n episodes, the question remains how to handle these n-1 extra observations. This PR modifies the AutoResetWrapper to provide two modes with the following behavior.

Ignore the terminal observation (This is the behavior of AutoResetWrapper prior to this PR):
- When the environment is reset, we simply ignore the terminal observation, save it in the info dict for reference, and return the next observation, which is the first obs of the next episode (the obs returned by the reset).
- One consequence of this is that the final transition of an episode will be one where the reset of the env can be seen in the change of obs to next_obs. This might leak information somewhat to a reward model, especially if the reward is generally provided at the end of the episode (such as in coinrun). Although, prematurely ending an episode without or with a different reward should somewhat lessen this effect.
The new behavior, which amounts to padding the trajectory with an additional transition that happens after the original terminal transition and is dedicated to switching from the end of the previous to the start of the next episode.
- To hopefully help clarify what I mean by this: next_state is the returned value of step. Generally in gym, we override the final next_step when we call reset. Using this behavior, we don’t override anything, instead, we return the final obs as usual and then have an additional timestep that ends in returning the initial obs of the next episode.
- For this added timestep I decided to simply return an empty info dict and reward 0.
- This added transition will still noticeably have observations that contain information regarding the reset of the env. However, this particular transition will never contain meaningful information about the reward of the wrapped environment.

I chose to use the latter behavior as default since it presumably leaks less information.

This PR also has a test case for this new behavior.

PavelCz · 2022-12-28T13:45:20Z

Also pinging @dfilan, since he is using AutoResetWrapper and might have some thoughts.

codecov · 2022-12-28T13:52:52Z

Codecov Report

Merging #69 (b669b1f) into master (dff53ee) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##            master       #69   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           26        26           
  Lines         1047      1084   +37     
=========================================
+ Hits          1047      1084   +37

Impacted Files	Coverage Δ
src/seals/util.py	`100.00% <100.00%> (ø)`
tests/test_wrappers.py	`100.00% <100.00%> (ø)`

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

dfilan · 2023-01-02T21:39:58Z

(Requested a review from a somewhat random person, sorry if I picked the wrong one)

Rocamonde · 2023-01-13T12:56:52Z

Since you're changing the default behavior, this is a breaking change in the API, so I'm thinking we might want to bump the version to 0.2.x. What do you think, @AdamGleave?

Rocamonde · 2023-01-13T13:22:31Z

The code does seem to do what you intend it to do, I'm just a bit confused about the mathematical reasoning behind this.

The default behavior of ignoring the terminal observation and replacing it with a reset seems bad since one is introducing a trajectory fragment where the learner might think that the action that in reality leads to the terminal state appears to lead to a uniformly sampled initial state, which is also non-physical. But in your modification, ignoring the action the user takes conditioned on the terminal observation effectively does the same thing (introduce a fake transition), with the only difference that the terminal observation is still in the outer trajectory data.

I guess an argument in favor of the modifications is that the behavior of the agent in transitions beyond termination should not be relied upon in any case, so it doesn't matter if the algorithm doesn't learn anything sensible beyond terminal observations; but for trajectory fragments part of the POMDP, you don't want to be excluding important information.

You're also manually injecting a reward of zero, which seems fine for many cases, but probably not OK to assume all reward functions are positive semi definite.

Please, do correct me if I'm misunderstanding anything!

PavelCz · 2023-01-13T21:20:59Z

Yeah that's right and those problems exist, I guess I'm not sure there is any other way around that.
In essence, if we use ARW + TimeLimit wrapper, since we override the done signal, the wrapped environment is an env with a different definition of what an 'episode' is. I think the wrapped and the unwrapped environment should basically be seen as two different environments.

As a workaround for the problem of always returning reward 0, I suggest I could add an optional field that is the fixed reward that gets returned for the terminal observation and defaults to 0.
We could even go so far and have an optional alternative behavior such as returning the previous reward in that situation instead of a fixed reward, though that seems a little bit like overkill.

PavelCz · 2023-01-13T21:22:57Z

BTW, I somewhat prefer the default behavior suggested in this PR, but it would also be fine to change it so the default is as it was before. Then we wouldn't have to make breaking changes and could always introduce these breaking changes at a later point if wanted.

Rocamonde · 2023-01-17T12:05:46Z

As a workaround for the problem of always returning reward 0, I suggest I could add an optional field that is the fixed reward that gets returned for the terminal observation and defaults to 0.

I think this should work

We could even go so far and have an optional alternative behavior such as returning the previous reward in that situation instead of a fixed reward, though that seems a little bit like overkill.

This would be a nice option in principle, but since I'm not using this feature myself not sure how useful that would be in practice. it might be worth adding if the change is simple to make and it keeps a clean API, but otherwise your previous suggestion is probably fine.

BTW, I somewhat prefer the default behavior suggested in this PR, but it would also be fine to change it so the default is as it was before. Then we wouldn't have to make breaking changes and could always introduce these breaking changes at a later point if wanted.

I think this would be ideal, so we can get this merged right away. Otherwise we'd have to run checks on imitation etc and make a separate decision

Rocamonde · 2023-01-17T12:06:11Z

Let me know if you want help implementing this, even though I think it should be pretty straightforward. Happy to make a final review once that's done.

Discard terminal obs by default, set reset reward

Rocamonde

LGTM

Rocamonde · 2023-01-18T14:53:23Z

Linter is currently failing

PavelCz · 2023-01-18T15:29:45Z

Hm, code_checks (which includes make html) works for my locally. Also I don't see what we would have changed since last week that would affect that.

Rocamonde · 2023-01-18T18:41:03Z

Could be a package issue, could you try and check if you have the same versions of everything installed?

PavelCz · 2023-01-18T19:49:49Z

Ok, re-running build_venv is not enough, I have to completely create in from scratch. now I can reproduce this error.

Using export SPHINXOPTS="-v" I got this error (+some other exception, but I think this is the important part), still not sure what causes this.

reading sources... [ 28%] common/testing

Traceback (most recent call last):
  File "/home/pavel/code/chai/seals/venv/lib/python3.8/site-packages/sphinx/events.py", line 96, in emit
    results.append(listener.handler(self.app, *args))
  File "/home/pavel/code/chai/seals/venv/lib/python3.8/site-packages/sphinx_autodoc_typehints/__init__.py", line 539, in process_docstring
    _inject_types_to_docstring(type_hints, signature, original_obj, app, what, name, lines)
  File "/home/pavel/code/chai/seals/venv/lib/python3.8/site-packages/sphinx_autodoc_typehints/__init__.py", line 601, in _inject_types_to_docstring
    _inject_rtype(type_hints, original_obj, app, what, name, lines)
  File "/home/pavel/code/chai/seals/venv/lib/python3.8/site-packages/sphinx_autodoc_typehints/__init__.py", line 747, in _inject_rtype
    r = get_insert_index(app, lines)
  File "/home/pavel/code/chai/seals/venv/lib/python3.8/site-packages/sphinx_autodoc_typehints/__init__.py", line 725, in get_insert_index
    at = line_before_node(doc.children[idx])
  File "/home/pavel/code/chai/seals/venv/lib/python3.8/site-packages/sphinx_autodoc_typehints/__init__.py", line 664, in line_before_node
    assert line
AssertionError

Will look more into this tomorrow.

PavelCz · 2023-01-20T10:43:26Z

Ok, this seems to have been a bug introduced in sphinx-autodoc-typehints version 1.21.4 (see issue), but fixed in version 1.21.5.
I have opted to pin the version to >= 1.21.5

PavelCz added 4 commits December 22, 2022 10:56

Add discard_terminal_observation flag to AutoResetWrapper

dd55693

Fix comments

9ad7cfb

Add AutoResetWrapper test

4032c72

Typo

3cad441

PavelCz changed the title ~~Keeping terminal observation in AutoResetWrapper~~ Don't override terminal observation when using AutoResetWrapper Dec 28, 2022

dfilan requested a review from Rocamonde January 2, 2023 21:39

PavelCz mentioned this pull request Jan 11, 2023

Preference Comparisons with Updated AutoResetWrapper HumanCompatibleAI/reward-function-interpretability#34

Merged

dfilan and others added 4 commits January 17, 2023 11:49

Discard terminal obs by default, set reset reward

ab0e7fc

Fix type error in base_envs

4e8f5b1

Revert type error fix

474ff9b

Merge pull request #1 from PavelCz/daniel_autoreset_tweaks

f6ffcf5

Discard terminal obs by default, set reset reward

This comment was marked as duplicate.

Sign in to view

Rocamonde approved these changes Jan 18, 2023

View reviewed changes

Pin sphinx-autodoc-typehints version

b669b1f

PavelCz requested a review from Rocamonde January 20, 2023 10:43

Rocamonde merged commit de29873 into HumanCompatibleAI:master Jan 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't override terminal observation when using AutoResetWrapper #69

Don't override terminal observation when using AutoResetWrapper #69

PavelCz commented Dec 28, 2022 •

edited

PavelCz commented Dec 28, 2022

codecov bot commented Dec 28, 2022 •

edited

dfilan commented Jan 2, 2023

Rocamonde commented Jan 13, 2023

Rocamonde commented Jan 13, 2023

PavelCz commented Jan 13, 2023

PavelCz commented Jan 13, 2023

Rocamonde commented Jan 17, 2023

Rocamonde commented Jan 17, 2023

This comment was marked as duplicate.

Rocamonde left a comment

Rocamonde commented Jan 18, 2023

PavelCz commented Jan 18, 2023

Rocamonde commented Jan 18, 2023

PavelCz commented Jan 18, 2023

PavelCz commented Jan 20, 2023

Don't override terminal observation when using AutoResetWrapper #69

Don't override terminal observation when using AutoResetWrapper #69

Conversation

PavelCz commented Dec 28, 2022 • edited

PavelCz commented Dec 28, 2022

codecov bot commented Dec 28, 2022 • edited

Codecov Report

dfilan commented Jan 2, 2023

Rocamonde commented Jan 13, 2023

Rocamonde commented Jan 13, 2023

PavelCz commented Jan 13, 2023

PavelCz commented Jan 13, 2023

Rocamonde commented Jan 17, 2023

Rocamonde commented Jan 17, 2023

This comment was marked as duplicate.

Rocamonde left a comment

Choose a reason for hiding this comment

Rocamonde commented Jan 18, 2023

PavelCz commented Jan 18, 2023

Rocamonde commented Jan 18, 2023

PavelCz commented Jan 18, 2023

PavelCz commented Jan 20, 2023

PavelCz commented Dec 28, 2022 •

edited

codecov bot commented Dec 28, 2022 •

edited