Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't override terminal observation when using AutoResetWrapper #69

Merged
merged 9 commits into from
Jan 20, 2023

Conversation

PavelCz
Copy link
Member

@PavelCz PavelCz commented Dec 28, 2022

In general, the total number of observations in an episode is always 1 + the number of transitions/actions, because there is always a final next_state observation, which an agent does not act on. The AutoResetWrapper effectively combines several of the episodes of the underlying environment into a single continuing episode.
gym does not anticipate this use case. It will only return a single observation per transition in addition to the very first observation that gets passed the agent, which is generally accessed by calling .reset() after an episode is done. Consequently, if we combine n episodes, the question remains how to handle these n-1 extra observations. This PR modifies the AutoResetWrapper to provide two modes with the following behavior.

  • Ignore the terminal observation (This is the behavior of AutoResetWrapper prior to this PR):
    • When the environment is reset, we simply ignore the terminal observation, save it in the info dict for reference, and return the next observation, which is the first obs of the next episode (the obs returned by the reset).
    • One consequence of this is that the final transition of an episode will be one where the reset of the env can be seen in the change of obs to next_obs. This might leak information somewhat to a reward model, especially if the reward is generally provided at the end of the episode (such as in coinrun). Although, prematurely ending an episode without or with a different reward should somewhat lessen this effect.
  • The new behavior, which amounts to padding the trajectory with an additional transition that happens after the original terminal transition and is dedicated to switching from the end of the previous to the start of the next episode.
    • To hopefully help clarify what I mean by this: next_state is the returned value of step. Generally in gym, we override the final next_step when we call reset. Using this behavior, we don’t override anything, instead, we return the final obs as usual and then have an additional timestep that ends in returning the initial obs of the next episode.
    • For this added timestep I decided to simply return an empty info dict and reward 0.
    • This added transition will still noticeably have observations that contain information regarding the reset of the env. However, this particular transition will never contain meaningful information about the reward of the wrapped environment.

I chose to use the latter behavior as default since it presumably leaks less information.

This PR also has a test case for this new behavior.

@PavelCz
Copy link
Member Author

PavelCz commented Dec 28, 2022

Also pinging @dfilan, since he is using AutoResetWrapper and might have some thoughts.

@codecov
Copy link

codecov bot commented Dec 28, 2022

Codecov Report

Merging #69 (b669b1f) into master (dff53ee) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##            master       #69   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           26        26           
  Lines         1047      1084   +37     
=========================================
+ Hits          1047      1084   +37     
Impacted Files Coverage Δ
src/seals/util.py 100.00% <100.00%> (ø)
tests/test_wrappers.py 100.00% <100.00%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@PavelCz PavelCz changed the title Keeping terminal observation in AutoResetWrapper Don't override terminal observation when using AutoResetWrapper Dec 28, 2022
@dfilan dfilan requested a review from Rocamonde January 2, 2023 21:39
@dfilan
Copy link
Collaborator

dfilan commented Jan 2, 2023

(Requested a review from a somewhat random person, sorry if I picked the wrong one)

@Rocamonde
Copy link
Member

Since you're changing the default behavior, this is a breaking change in the API, so I'm thinking we might want to bump the version to 0.2.x. What do you think, @AdamGleave?

@Rocamonde
Copy link
Member

The code does seem to do what you intend it to do, I'm just a bit confused about the mathematical reasoning behind this.

The default behavior of ignoring the terminal observation and replacing it with a reset seems bad since one is introducing a trajectory fragment where the learner might think that the action that in reality leads to the terminal state appears to lead to a uniformly sampled initial state, which is also non-physical. But in your modification, ignoring the action the user takes conditioned on the terminal observation effectively does the same thing (introduce a fake transition), with the only difference that the terminal observation is still in the outer trajectory data.

I guess an argument in favor of the modifications is that the behavior of the agent in transitions beyond termination should not be relied upon in any case, so it doesn't matter if the algorithm doesn't learn anything sensible beyond terminal observations; but for trajectory fragments part of the POMDP, you don't want to be excluding important information.

You're also manually injecting a reward of zero, which seems fine for many cases, but probably not OK to assume all reward functions are positive semi definite.

Please, do correct me if I'm misunderstanding anything!

@PavelCz
Copy link
Member Author

PavelCz commented Jan 13, 2023

Yeah that's right and those problems exist, I guess I'm not sure there is any other way around that.
In essence, if we use ARW + TimeLimit wrapper, since we override the done signal, the wrapped environment is an env with a different definition of what an 'episode' is. I think the wrapped and the unwrapped environment should basically be seen as two different environments.

As a workaround for the problem of always returning reward 0, I suggest I could add an optional field that is the fixed reward that gets returned for the terminal observation and defaults to 0.
We could even go so far and have an optional alternative behavior such as returning the previous reward in that situation instead of a fixed reward, though that seems a little bit like overkill.

@PavelCz
Copy link
Member Author

PavelCz commented Jan 13, 2023

BTW, I somewhat prefer the default behavior suggested in this PR, but it would also be fine to change it so the default is as it was before. Then we wouldn't have to make breaking changes and could always introduce these breaking changes at a later point if wanted.

@Rocamonde
Copy link
Member

As a workaround for the problem of always returning reward 0, I suggest I could add an optional field that is the fixed reward that gets returned for the terminal observation and defaults to 0.

I think this should work

We could even go so far and have an optional alternative behavior such as returning the previous reward in that situation instead of a fixed reward, though that seems a little bit like overkill.

This would be a nice option in principle, but since I'm not using this feature myself not sure how useful that would be in practice. it might be worth adding if the change is simple to make and it keeps a clean API, but otherwise your previous suggestion is probably fine.

BTW, I somewhat prefer the default behavior suggested in this PR, but it would also be fine to change it so the default is as it was before. Then we wouldn't have to make breaking changes and could always introduce these breaking changes at a later point if wanted.

I think this would be ideal, so we can get this merged right away. Otherwise we'd have to run checks on imitation etc and make a separate decision

@Rocamonde
Copy link
Member

Let me know if you want help implementing this, even though I think it should be pretty straightforward. Happy to make a final review once that's done.

Rocamonde

This comment was marked as duplicate.

Copy link
Member

@Rocamonde Rocamonde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Rocamonde
Copy link
Member

Linter is currently failing

@PavelCz
Copy link
Member Author

PavelCz commented Jan 18, 2023

Hm, code_checks (which includes make html) works for my locally. Also I don't see what we would have changed since last week that would affect that.

@Rocamonde
Copy link
Member

Could be a package issue, could you try and check if you have the same versions of everything installed?

@PavelCz
Copy link
Member Author

PavelCz commented Jan 18, 2023

Ok, re-running build_venv is not enough, I have to completely create in from scratch. now I can reproduce this error.

Using export SPHINXOPTS="-v" I got this error (+some other exception, but I think this is the important part), still not sure what causes this.

reading sources... [ 28%] common/testing

Traceback (most recent call last):
  File "/home/pavel/code/chai/seals/venv/lib/python3.8/site-packages/sphinx/events.py", line 96, in emit
    results.append(listener.handler(self.app, *args))
  File "/home/pavel/code/chai/seals/venv/lib/python3.8/site-packages/sphinx_autodoc_typehints/__init__.py", line 539, in process_docstring
    _inject_types_to_docstring(type_hints, signature, original_obj, app, what, name, lines)
  File "/home/pavel/code/chai/seals/venv/lib/python3.8/site-packages/sphinx_autodoc_typehints/__init__.py", line 601, in _inject_types_to_docstring
    _inject_rtype(type_hints, original_obj, app, what, name, lines)
  File "/home/pavel/code/chai/seals/venv/lib/python3.8/site-packages/sphinx_autodoc_typehints/__init__.py", line 747, in _inject_rtype
    r = get_insert_index(app, lines)
  File "/home/pavel/code/chai/seals/venv/lib/python3.8/site-packages/sphinx_autodoc_typehints/__init__.py", line 725, in get_insert_index
    at = line_before_node(doc.children[idx])
  File "/home/pavel/code/chai/seals/venv/lib/python3.8/site-packages/sphinx_autodoc_typehints/__init__.py", line 664, in line_before_node
    assert line
AssertionError

Will look more into this tomorrow.

@PavelCz
Copy link
Member Author

PavelCz commented Jan 20, 2023

Ok, this seems to have been a bug introduced in sphinx-autodoc-typehints version 1.21.4 (see issue), but fixed in version 1.21.5.
I have opted to pin the version to >= 1.21.5

@Rocamonde Rocamonde merged commit de29873 into HumanCompatibleAI:master Jan 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants