Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added complete support for pathlib #1678

Closed
wants to merge 36 commits into from

Conversation

pioneerHitesh
Copy link
Contributor

@pioneerHitesh pioneerHitesh commented Oct 18, 2023

Added pathlib (path object) support for the entire dataset module

Closes #1616

@github-actions github-actions bot added the datasets Geospatial or benchmark datasets label Oct 18, 2023
@pioneerHitesh
Copy link
Contributor Author

@microsoft-github-policy-service agree

@microsoft-github-policy-service agree

torchgeo/datasets/agb_live_woody_density.py Outdated Show resolved Hide resolved
torchgeo/datasets/agb_live_woody_density.py Show resolved Hide resolved
torchgeo/datasets/geo.py Outdated Show resolved Hide resolved
@pioneerHitesh
Copy link
Contributor Author

pioneerHitesh commented Oct 18, 2023

@adamjstewart There is an urllib.error.HTTPError: HTTP Error 500: Internal Server Error occuring irrespective of path being str or os.PathLike . This issue occures when compiling from github and installing using pip3. It is occuring for AbovegroundLiveWoodyBiomassDensity which is a module from torchgeo.datasets.agb_live_woody_density

Here are the logs

Traceback (most recent call last):
  File "/home/hitesh/torchgeo/test.py", line 13, in <module>
    AgLWBD = AbovegroundLiveWoodyBiomassDensity(paths="/home/hitesh/test", download=True)
  File "/home/hitesh/torchgeo/torchgeo/datasets/agb_live_woody_density.py", line 91, in __init__
    self._verify()
  File "/home/hitesh/torchgeo/torchgeo/datasets/agb_live_woody_density.py", line 114, in _verify
    self._download()
  File "/home/hitesh/torchgeo/torchgeo/datasets/agb_live_woody_density.py", line 123, in _download
    download_url(self.url, self.paths, self.base_filename)
  File "/home/hitesh/torchgeo/env/lib/python3.10/site-packages/torchvision/datasets/utils.py", line 134, in download_url
    url = _get_redirect_url(url, max_hops=max_redirect_hops)
  File "/home/hitesh/torchgeo/env/lib/python3.10/site-packages/torchvision/datasets/utils.py", line 82, in _get_redirect_url
    with urllib.request.urlopen(urllib.request.Request(url, headers=headers)) as response:
  File "/usr/lib/python3.10/urllib/request.py", line 216, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.10/urllib/request.py", line 525, in open
    response = meth(req, response)
  File "/usr/lib/python3.10/urllib/request.py", line 634, in http_response
    response = self.parent.error(
  File "/usr/lib/python3.10/urllib/request.py", line 563, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.10/urllib/request.py", line 496, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.10/urllib/request.py", line 643, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 500: Internal Server Error

@adamjstewart
Copy link
Collaborator

@pioneerHitesh can you open a separate issue for the download problems? Maybe the URL changed.

@github-actions github-actions bot added the testing Continuous integration testing label Oct 18, 2023
torchgeo/datasets/agb_live_woody_density.py Outdated Show resolved Hide resolved
torchgeo/datasets/agb_live_woody_density.py Outdated Show resolved Hide resolved
torchgeo/datasets/geo.py Outdated Show resolved Hide resolved
torchgeo/datasets/utils.py Outdated Show resolved Hide resolved
torchgeo/datasets/utils.py Outdated Show resolved Hide resolved
@github-actions github-actions bot added the testing Continuous integration testing label Oct 21, 2023
@github-actions github-actions bot added the dependencies Packaging and dependencies label Oct 21, 2023
@pioneerHitesh
Copy link
Contributor Author

@adamjstewart any additional changes ?

Copy link
Collaborator

@adamjstewart adamjstewart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, it seems like 99% of the time, we accept any Path as input, then immediately convert it to a str. I think this is still useful, but it's a lot of extra work to ensure valid types and cast everything. Is torchvision's download_url the only reason we need to cast to a str? I wonder if we could just update download_url to accept any Path as input. I'm willing to submit a PR to torchvision if you think that's worth it. We could even have a wrapper in TorchGeo that just converts the path to a str and calls the torchvision version. Let me know what you think of this idea.

Also, this PR is becoming massive. Is there any way we can split this up into multiple PRs to make it easier on both of us to update/review? Maybe 1 PR to remove os.path.expanduser, another PR to fix root={self.paths} to paths={self.paths!r}, another PR to add support to the datasets, and another PR to update the tests? Or would that be too much work for you?

assert isinstance(dataset.paths, str)
zipfile = os.path.join(dataset.paths, "eu_dem_v11_E30N10.zip")
shutil.unpack_archive(zipfile, dataset.paths, "zip")
assert check_instance_type(dataset.paths)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just assert that this is a pathlib.Path

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adamjstewart We may not be able to do that because we have to assert for str,bytes , os.PathLike[str] and
os.PathLike[bytes].

tests/datasets/test_landcoverai.py Show resolved Hide resolved
tests/datasets/test_mapinwild.py Show resolved Hide resolved
torchgeo/datasets/agb_live_woody_density.py Show resolved Hide resolved
torchgeo/datasets/astergdem.py Show resolved Hide resolved
torchgeo/datasets/nlcd.py Show resolved Hide resolved
torchgeo/datasets/openbuildings.py Show resolved Hide resolved
torchgeo/datasets/utils.py Show resolved Hide resolved

import numpy as np
import rasterio
import torch
from torch import Tensor
from torchvision.datasets.utils import check_integrity, download_url
from torchvision.utils import draw_segmentation_masks
from typing_extensions import TypeAlias
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have a dep on typing_extensions right now. Let's just remove the TypeAlias usage, we don't actually need it.

@@ -737,3 +740,25 @@ def percentile_normalization(
(img - lower_percentile) / (upper_percentile - lower_percentile + 1e-5), 0, 1
)
return img_normalized


def check_instance_type(paths: Path | Iterable[Path]) -> bool:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love this function:

  • The name is way too vague
  • The docstring doesn't tell me when it's True and when it's False
  • Do we really need this complex of a check?

It seems like we're really just checking if paths is a Path (True) or Iterable[Path] (False). Can't we just do:

return not isinstance(paths, Iterable)

? In that case, we wouldn't even need a function. Basically, if we can figure out a reasonable way to remove this function, I would be happier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adamjstewart Unfortunately isinstance(paths, Iterable) will return True for bytes and str. We will have to find another way to eliminate this function. One way would be to have List[Path] instead of Iterable[Path].
Then we may use isinstance(paths,List).

@adamjstewart adamjstewart added this to the 0.6.0 milestone Oct 25, 2023
@pioneerHitesh
Copy link
Contributor Author

At the moment, it seems like 99% of the time, we accept any Path as input, then immediately convert it to a str. I think this is still useful, but it's a lot of extra work to ensure valid types and cast everything. Is torchvision's download_url the only reason we need to cast to a str? I wonder if we could just update download_url to accept any Path as input. I'm willing to submit a PR to torchvision if you think that's worth it. We could even have a wrapper in TorchGeo that just converts the path to a str and calls the torchvision version. Let me know what you think of this idea.

Also, this PR is becoming massive. Is there any way we can split this up into multiple PRs to make it easier on both of us to update/review? Maybe 1 PR to remove os.path.expanduser, another PR to fix root={self.paths} to paths={self.paths!r}, another PR to add support to the datasets, and another PR to update the tests? Or would that be too much work for you?

In addition to the torchvision's download_url , the os.path.join also needs str. If we can drop support for Path being of type bytes and pathlib.Path[bytes] then we do not need to cast to str. The problem with bytes or pathlib.Path[bytes] is if the second argument passed in the join method is a string it is unable to combine the paths. It also does not support merging of pathlib.Path[bytes] and bytes. Another issue i have discovered is pathlib.path does not support bytes. So we will have to drop support for atleast os.PathLike[bytes] and if feasible also bytes.

My suggestion would be that we add support for os.PathLike[str] in torchvision's download_url as that would benefit others as well and will eliminate the need for writing a wrapper function just to cast to str. I concur with you regarding the PR becoming unviable to update/review and will split the PR. However we cannot split the datasets and tests separately as that would lead to failing of tests instead we could split datasets and tests by limiting the number of datasets and tests files changed in a commit .

@adamjstewart
Copy link
Collaborator

However we cannot split the datasets and tests separately as that would lead to failing of tests instead we could split datasets and tests by limiting the number of datasets and tests files changed in a commit .

We could fix the datasets in one PR and then update the tests in a second PR after the first one is merged.

@pioneerHitesh
Copy link
Contributor Author

However we cannot split the datasets and tests separately as that would lead to failing of tests instead we could split datasets and tests by limiting the number of datasets and tests files changed in a commit .

We could fix the datasets in one PR and then update the tests in a second PR after the first one is merged.

Should I start implementing the changes?

@adamjstewart
Copy link
Collaborator

Yeah, let's start with PRs for os.path.expanduser and root={self.paths}, then once those are merged a PR for Path.

@pioneerHitesh
Copy link
Contributor Author

pioneerHitesh commented Oct 26, 2023

Creating a separate issue would be better? This PR already has 75 comments.

@adamjstewart
Copy link
Collaborator

We can either close this PR and start new PRs or mark this one as a draft and rebase once other PRs are merged. I have no preference.

@pioneerHitesh
Copy link
Contributor Author

We can either close this PR and start new PRs or mark this one as a draft and rebase once other PRs are merged. I have no preference.

I think we should convert this to draft.

@pioneerHitesh pioneerHitesh marked this pull request as draft October 26, 2023 18:09
@pioneerHitesh
Copy link
Contributor Author

Yeah, let's start with PRs for os.path.expanduser and root={self.paths}, then once those are merged a PR for Path.

should i rename root to path in __init__ methods?

@adamjstewart
Copy link
Collaborator

For NonGeoDatasets we want to keep using root. For GeoDatasets, we should use paths everywhere, including in error messages.

@pioneerHitesh
Copy link
Contributor Author

@adamjstewart please take a look at this PR.

@pioneerHitesh pioneerHitesh marked this pull request as ready for review October 27, 2023 10:59
@adamjstewart adamjstewart removed this from the 0.6.0 milestone Nov 18, 2023
@adamjstewart
Copy link
Collaborator

Closing in favor of #1731

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Geospatial or benchmark datasets dependencies Packaging and dependencies testing Continuous integration testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for pathlib
2 participants