Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot infer top-level imports from packages installed by PDM w/install.cache=true (aka. symlinked packages) #452

Closed
jherland opened this issue Apr 19, 2023 · 10 comments

Comments

@jherland
Copy link
Contributor

PDM has an alternative to install packages (enabled with pdm config install.cache true) where the package is first installed into a local cache (typically under $HOME/.cache/pdm/packages/), and then the package is installed into the target virtualenv, but with symlinks pointing back to the package in the cache. Here is an example:

$ pdm config install.cache true
$ pdm init
[...]
$ pdm add colorama
[...]
$ ls -al .venv/lib/python3.10/site-packages/colorama*
lrwxrwxrwx 1 jherland users   79 Apr 19 03:03 .venv/lib/python3.10/site-packages/colorama -> /home/jherland/.cache/pdm/packages/colorama-0.4.6-py2.py3-none-any/lib/colorama

.venv/lib/python3.10/site-packages/colorama-0.4.6.dist-info:
total 44
drwxr-xr-x 3 jherland users  4096 Apr 19 03:03 .
drwxr-xr-x 4 jherland users  4096 Apr 19 03:03 ..
drwxr-xr-x 2 jherland users  4096 Apr 19 03:03 licenses
-rw-r--r-- 1 jherland users 17158 Apr 19 03:03 METADATA
-rw-r--r-- 1 jherland users   412 Apr 19 03:03 RECORD
-rw-r--r-- 1 jherland users    66 Apr 19 03:03 REFER_TO
-rw-r--r-- 1 jherland users   105 Apr 19 03:03 WHEEL
$ cat .venv/lib/python3.10/site-packages/colorama-0.4.6.dist-info/RECORD
colorama-0.4.6.dist-info/METADATA,sha256=e67SnrUMOym9sz_4TjF3vxvAV4T3aF7NyqRHHH3YEMw,17158
colorama-0.4.6.dist-info/WHEEL,sha256=cdcF4Fbd0FPtw2EMIOwH-3rSOTUdTCeOSXRMD1iLUb8,105
colorama-0.4.6.dist-info/licenses/LICENSE.txt,sha256=ysNcAmhuXQSlpxQL-zs25zrtSWZW6JEQLkKIhteTAxg,1491
colorama-0.4.6.dist-info/REFER_TO,sha256=b2lWg3Osknh6HODITiPmZW55TlJ8okVt-8k6rf3kdpc,66
colorama-0.4.6.dist-info/RECORD,,
colorama,,
$ cat .venv/lib/python3.10/site-packages/colorama-0.4.6.dist-info/REFER_TO
/home/jherland/.cache/pdm/packages/colorama-0.4.6-py2.py3-none-any

Specifically:

  • There is no top-level.txt file, so top-level imports must be inferred from RECORD.
  • The RECORD file does not mention any Python files under colorama/, only the colorama entry itself is listed. This is reflected in importlib_metadata.files("colorama"), and as a result _top_level_inferred(colorama) ends up finding nothing, and packages_distributions() incorrectly presents zero imports for this packages.
  • Inside the site-packages directory, colorama is a symlink to the corresponding colorama directory inside the PDM cache, so Python is indeed able to import colorama and find modules within this package.

This was first reported via tweag/FawltyDeps#307, and (as a FawltyDeps maintainer and importlib_metadata user) I'm trying to figure out if this is an issue with importlib_metadata not properly handling PDM's weird (but techinically valid) installation method, or if PDM is doing something strictly incorrect here.

@jaraco
Copy link
Member

jaraco commented Apr 22, 2023

The packaging spec is weak about RECORD and top-level.txt. For that matter, it's probably also weak about the relationship between installed assets and linked assets.

My first instinct is that PDM should supply a top-level.txt, so that importlib metadata doesn't have to infer it. Is that feasible?

I also wonder if the inference shouldn't be updated to honor just a directory. That is, the name colorama in the RECORD does suggest a package named colorama, even if it doesn't contain any additional files, but only if it's a directory. Unfortunately, RECORD doesn't indicate if it's a directory.

I'm reluctant to go further into the inference by inspecting the environment because that process doesn't generalize well. Although PackagePath.locate and thus Distribution.locate_file return paths on the file system, they're not generally the canonical location of the path (which could be virtual, such as in-memory or in the cloud). I want to be cautious about adding more logic that relies on the assumption that a package is found on the local disk.

@jherland
Copy link
Contributor Author

My first instinct is that PDM should supply a top-level.txt, so that importlib metadata doesn't have to infer it. Is that feasible?

I don't disagree, but since I'm developing a tool that is run in the context of a user's project that happens to be PDM-managed, I'm not really in a position to change how the user manages their project. Also, Isn't the top_level.txt generated at wheel build time? As such, I'm not sure PDM can be expected to manufacture a top_level.txt when installing a wheel that was built elsewhere and downloaded via PyPI?

Unfortunately, RECORD doesn't indicate if it's a directory.

True, although if it were a regular file, it should have an associated hash (and size). Quoting https://packaging.python.org/en/latest/specifications/binary-distribution-format/#the-dist-info-directory:
"every file except RECORD, which cannot contain a hash of itself, must include its hash".

Is it too far-fetched to deduce from this that a RECORD entry without a hash (and size) (except RECORD itself) therefore must point to a directory (or more generally: something that might turn out to be importable)?

I want to be cautious about adding more logic that relies on the assumption that a package is found on the local disk.

Understood.

Do you consider PDM's behavior here a bug, and would it be appropriate to raise an issue with PDM about this? I suspect it may be hard to argue this given that Python itself apparently has no problem importing packages that are installed this way.

@jaraco
Copy link
Member

jaraco commented Jun 18, 2023

I tried replicating the environment, but I'm running up against an issue with pdm. It seems to get confused if there's more than one Python interpreter on the system:

 draft $ docker build --no-cache .
[+] Building 5.6s (7/7) FINISHED                                                                                                                
 => [internal] load .dockerignore                                                                                                          0.0s
 => => transferring context: 2B                                                                                                            0.0s
 => [internal] load build definition from Dockerfile                                                                                       0.0s
 => => transferring dockerfile: 134B                                                                                                       0.0s
 => [internal] load metadata for docker.io/jaraco/multipy-tox:latest                                                                       0.0s
 => CACHED [1/4] FROM docker.io/jaraco/multipy-tox                                                                                         0.0s
 => [2/4] RUN pipx install pdm                                                                                                             3.4s
 => [3/4] RUN pdm init --non-interactive                                                                                                   1.5s 
 => ERROR [4/4] RUN pdm add colorama                                                                                                       0.6s 
------                                                                                                                                          
 > [4/4] RUN pdm add colorama:                                                                                                                  
#0 0.370 Adding group default to lockfile                                                                                                       
#0 0.376 Adding packages to default dependencies: colorama                                                                                      
#0 0.392 The saved Python interpreter doesn't match the project's requirement. Trying to find another one.
#0 0.407 python.use_venv is on, creating a virtualenv for this project...
#0 0.600 [VirtualenvCreateError]: Can't resolve python interpreter
------
Dockerfile:4
--------------------
   2 |     run pipx install pdm
   3 |     run pdm init --non-interactive
   4 | >>> run pdm add colorama
   5 |     
--------------------
ERROR: failed to solve: process "/bin/sh -c pdm add colorama" did not complete successfully: exit code: 1

Although pdm is installed in a virtualenv (using pipx) on Python 3.11, the virtual env it creates in ./.venv is for Python 3.12, even though py and python and python3 all refer to Python 3.11.

Aha. It does appear that if I pass --python 3.11 to pdm init, the errors go away.

from jaraco/multipy-tox
run pipx install pdm
run pdm init --non-interactive --python 3.11
run pdm add colorama

@jaraco
Copy link
Member

jaraco commented Jun 18, 2023

Unfortunately, even after attempting to replicate the indicated failure, I cannot. The symlinks and record are not being created:

 draft $ docker run -it @$(docker build -q .) cat .venv/lib/python3.11/site-packages/colorama-0.4.6.dist-info/RECORD
colorama/__init__.py,sha256=wePQA4U20tKgYARySLEC047ucNX-g8pRLpYBuiHlLb8,266
colorama/ansi.py,sha256=Top4EeEuaQdBWdteKMEcGOTeKeF19Q-Wo_6_Cj5kOzQ,2522
colorama/ansitowin32.py,sha256=vPNYa3OZbxjbuFyaVo0Tmhmy1FZ1lKMWCnT7odXpItk,11128
colorama/initialise.py,sha256=-hIny86ClXo39ixh5iSCfUIa2f_h_bgKRDW7gqs-KLU,3325
colorama/win32.py,sha256=YQOKwMTwtGBbsY4dL5HYTvwTeP9wIQra5MvPNddpxZs,6181
colorama/winterm.py,sha256=XCQFDHjPi6AHYNdZwy0tA02H-Jh48Jp-HvCjeLeLp3U,7134
colorama/tests/__init__.py,sha256=MkgPAEzGQd-Rq0w0PZXSX2LadRWhUECcisJY8lSrm4Q,75
colorama/tests/ansi_test.py,sha256=FeViDrUINIZcr505PAxvU4AjXz1asEiALs9GXMhwRaE,2839
colorama/tests/ansitowin32_test.py,sha256=RN7AIhMJ5EqDsYaCjVo-o4u8JzDD4ukJbmevWKS70rY,10678
colorama/tests/initialise_test.py,sha256=BbPy-XfyHwJ6zKozuQOvNvQZzsx9vdb_0bYXn7hsBTc,6741
colorama/tests/isatty_test.py,sha256=Pg26LRpv0yQDB5Ac-sxgVXG7hsA1NYvapFgApZfYzZg,1866
colorama/tests/utils.py,sha256=1IIRylG39z5-dzq09R_ngufxyPZxgldNbrxKxUGwGKE,1079
colorama/tests/winterm_test.py,sha256=qoWFPEjym5gm2RuMwpf3pOis3a5r_PJZFCzK254JL8A,3709
colorama-0.4.6.dist-info/METADATA,sha256=e67SnrUMOym9sz_4TjF3vxvAV4T3aF7NyqRHHH3YEMw,17158
colorama-0.4.6.dist-info/WHEEL,sha256=cdcF4Fbd0FPtw2EMIOwH-3rSOTUdTCeOSXRMD1iLUb8,105
colorama-0.4.6.dist-info/licenses/LICENSE.txt,sha256=ysNcAmhuXQSlpxQL-zs25zrtSWZW6JEQLkKIhteTAxg,1491
colorama-0.4.6.dist-info/RECORD,,

@jaraco
Copy link
Member

jaraco commented Jun 18, 2023

Found my mistake. Through all my troubleshooting, I'd missed the config setting. After adding the setting to the Dockerfile, I'm able to replicate the issue:

from jaraco/multipy-tox
run pipx install pdm
run pdm config install.cache true
run pdm init --non-interactive --python 3.11
run pdm add colorama
 draft $ docker run -it @$(docker build -q .) cat .venv/lib/python3.11/site-packages/colorama-0.4.6.dist-info/RECORD
colorama-0.4.6.dist-info/METADATA,sha256=e67SnrUMOym9sz_4TjF3vxvAV4T3aF7NyqRHHH3YEMw,17158
colorama-0.4.6.dist-info/WHEEL,sha256=cdcF4Fbd0FPtw2EMIOwH-3rSOTUdTCeOSXRMD1iLUb8,105
colorama-0.4.6.dist-info/licenses/LICENSE.txt,sha256=ysNcAmhuXQSlpxQL-zs25zrtSWZW6JEQLkKIhteTAxg,1491
colorama-0.4.6.dist-info/REFER_TO,sha256=78po-M2GCc5fiAv64DwmnzCULroRpLQBU3QvCI4MXGI,57
colorama-0.4.6.dist-info/RECORD,,
colorama,,

@jaraco
Copy link
Member

jaraco commented Jun 18, 2023

I notice that even the cached wheel doesn't have a top_level.txt in its metadata:

<user>@1012b0756cc9 / # ls ~/.cache/pdm/packages/colorama-0.4.6-py2.py3-none-any/lib/colorama-0.4.6.dist-info/
METADATA  RECORD  WHEEL  licenses

And even if I install colorama without install.cache, it doesn't have a top_level.txt.

@jaraco
Copy link
Member

jaraco commented Jun 18, 2023

Same with pip install of colorama - no top_level.txt.

Now I understand better. colorama isn't built with the top_level.txt metadata, so importlib metadata falls back to inferring from the RECORD file, which doesn't work in a install.cache install by PDM.

@jaraco
Copy link
Member

jaraco commented Jun 18, 2023

I have confirmed that by adding a top_level.txt does allow importlib metadata to resolve the package:

<user>@7be185f19e0f / # cat > .venv/lib/python3.11/site-packages/colorama-0.4.6.dist-info/top_level.txt
colorama
<user>@7be185f19e0f / # py -c "import importlib.metadata as md; print(md.packages_distributions()['colorama'])"
['colorama']

So, pdm could support this case by generating that top_level.txt from the linked package's RECORD (similar to what importlib metadata does.

I've also observed that pdm does in fact include the top_level.txt in the installed package's metadata if it was present in the wheel.

I see three options:

  • importlib metadata doesn't support this form of installation at all and it would be up to PDM to generate the top_level.txt for packages that don't supply it.
  • importlib metadata could infer that any symlink in RECORD is also importable and thus infer those top-level names as it does for regular files and directories.
  • importlib metadata could add first-class support for PDM's technique of installation and use the RECORD from the linked package for resolving files (and thus top-level names).

@jaraco
Copy link
Member

jaraco commented Jun 18, 2023

Isn't the top_level.txt generated at wheel build time? As such, I'm not sure PDM can be expected to manufacture a top_level.txt when installing a wheel that was built elsewhere and downloaded via PyPI?

I've confirmed that file is generated by setuptools (and possibly other tools) at build time as egg-info and (probably) carried over to the wheel and dist-info by wheel.

I don't think that precludes PDM from generating one, though it does beg the question - does this failure indicate that maybe these packages should be building with tools that provide this useful metadata? I'm not sure we want to add a fallback to a fallback to support these packages. Maybe PDM should warn when installing a package without top_level.txt into an environment with install.cache=true so that the user can report to that package that it isn't supplying the top_level.txt file.

Unfortunately, RECORD doesn't indicate if it's a directory.

True, although if it were a regular file, it should have an associated hash (and size). Quoting https://packaging.python.org/en/latest/specifications/binary-distribution-format/#the-dist-info-directory: "every file except RECORD, which cannot contain a hash of itself, must include its hash".

Is it too far-fetched to deduce from this that a RECORD entry without a hash (and size) (except RECORD itself) therefore must point to a directory (or more generally: something that might turn out to be importable)?

That seems reasonable, and from my perspective a modest tweak to the existing fallback, so more palatable.

Do you consider PDM's behavior here a bug, and would it be appropriate to raise an issue with PDM about this? I suspect it may be hard to argue this given that Python itself apparently has no problem importing packages that are installed this way.

Not precisely a bug, but definitely a missed expectation caused by challenging assumptions about how packages are installed. By creating a new abstraction layer between the metadata in site-packages and the package's original metadata, it's taken on the burden of owning that abstraction. That is, it's written a new RECORD to represent the virtual package and creating symlinks of all top-level names to the original package. That means importlib metadata no longer has a clear model to follow for "files supplied by this package" - should it supply the files actually installed in site-packages (the virtual package and its links) or should it supply files represented by the original package (in the cache)?

Both PDM and importlib metadata are operating outside the bounds of specifications here, providing best-effort solutions that nominally work, but which make assumptions and attempt to provide a reasonable experience in an ill-defined space.

@jaraco
Copy link
Member

jaraco commented Jun 18, 2023

Fixed in #453 and released as v6.7.0.

@jaraco jaraco closed this as completed Jun 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants