Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python 3 port uses Unicode to represent byte strings #537

Open
wgrant opened this issue Jun 17, 2015 · 5 comments
Open

Python 3 port uses Unicode to represent byte strings #537

wgrant opened this issue Jun 17, 2015 · 5 comments

Comments

@wgrant
Copy link

wgrant commented Jun 17, 2015

pygit2, when built for Python 3, treats paths as Unicode and will fail if a path isn't decodable as the filesystem encoding. But Git paths are byte strings, not Unicode strings. This includes refs, so repos with branch names containing non-UTF-8 sequences are completely unusable on most systems:

>>> repo.listall_references()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 14: invalid start byte

pygit2's behaviour under Python 2 is correct; listall_references and other APIs returns byte strings as they are in the Git model. I don't see why the behaviour should differ by Python version, as both types exist in both languages. It seems to me that the default low-level API should return byte strings to match the underlying model and handle all cases, and convenience wrappers which return Unicode strings could be added if people actually want them. As it stands, some perfectly valid Git repos are unusable except on Python 2.

@wgrant
Copy link
Author

wgrant commented Jun 18, 2015

The most reasonable solution I can think of that retains backward compatibility is to add encoding and errors properties to Repository, defaulting to the filesystem encoding and "strict" but overridable (even to None).

@rralf
Copy link

rralf commented Jul 27, 2018

Hi,

I can confirm this issue. This also happens for mis-encoded metadata in commits, like commit dcb71129841e5821c0cbbdd4017a6f202f180108 in the Linux kernel (look at the author name):

Reconstruction:

import pygit2
repo = pygit2.Repository('.')
commit = repo['dcb71129841e5821c0cbbdd4017a6f202f180108']
commit.author.name

Raises:

In [7]: commit.author.name
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-7-e2f11dcffc49> in <module>()
----> 1 commit.author.name

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 10: invalid start byte

`

@rralf
Copy link

rralf commented Jul 27, 2018

My local workaround is to use the raw members of the classes, and then let them through:

def fix_encoding(string):
    try:
        string = string.decode('utf-8')
    except:
        string = string.decode('iso8859')
    return string

@futatuki
Copy link

futatuki commented Nov 6, 2022

At least for paths, to convert C string into Python str object, it can be used PyUnicode_DecodeFSDefault() as to_path() in src/utils.h (, and I'm happy if its errors handler is "surrogateescape").

Also, it is natulal if conversion from path represented in Python bytes object to Python str object, e.g. in Repository.__init__() is done by os.fsdecode()

netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this issue May 19, 2024

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
# 1.15.0 (2024-05-18)

- Many deprecated features have been removed, see below

- Upgrade to libgit2 v1.8.1

- New `push_options` optional argument in `Repository.push(...)`
  [#1282](libgit2/pygit2#1282)

- New support comparison of `Oid` with text string

- Fix `CheckoutNotify.IGNORED`
  [#1288](libgit2/pygit2#1288)

- Use default error handler when decoding/encoding paths
  [#537](libgit2/pygit2#537)

- Remove setuptools runtime dependency
  [#1281](libgit2/pygit2#1281)

- Coding style with ruff
  [#1280](libgit2/pygit2#1280)

- Add wheels for ppc64le
  [#1279](libgit2/pygit2#1279)

- Fix tests on EPEL8 builds for s390x
  [#1283](libgit2/pygit2#1283)

Deprecations:

- Deprecate `IndexEntry.hex`, use `str(IndexEntry.id)`

Breaking changes:

- Remove deprecated `oid.hex`, use `str(oid)`
- Remove deprecated `object.hex`, use `str(object.id)`
- Remove deprecated `object.oid`, use `object.id`

- Remove deprecated `Repository.add_submodule(...)`, use `Repository.submodules.add(...)`
- Remove deprecated `Repository.lookup_submodule(...)`, use `Repository.submodules[...]`
- Remove deprecated `Repository.init_submodules(...)`, use `Repository.submodules.init(...)`
- Remove deprecated `Repository.update_submodule(...)`, use `Repository.submodules.update(...)`

- Remove deprecated constants `GIT_OBJ_XXX`, use `ObjectType`
- Remove deprecated constants `GIT_REVPARSE_XXX`, use `RevSpecFlag`
- Remove deprecated constants `GIT_REF_XXX`, use `ReferenceType`
- Remove deprecated `ReferenceType.OID`, use instead `ReferenceType.DIRECT`
- Remove deprecated `ReferenceType.LISTALL`, use instead `ReferenceType.ALL`

- Remove deprecated support for passing dicts to repository\'s `merge(...)`,
  `merge_commits(...)` and `merge_trees(...)`. Instead pass `MergeFlag` for `flags`, and
  `MergeFileFlag` for `file_flags`.

- Remove deprecated support for passing a string for the favor argument to repository\'s
  `merge(...)`, `merge_commits(...)` and `merge_trees(...)`. Instead pass `MergeFavor`.
@jdavid
Copy link
Member

jdavid commented May 29, 2024

In the latest release we're using PyUnicode_DecodeFSDefault() and PyUnicode_EncodeFSDefault()
It remains to use os.fsdecode() (PRs welcome).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants