Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(uv): parse the dist-manifest.json to not hardcode sha256 in rules_python #2578

Merged
merged 51 commits into from
Mar 11, 2025

Conversation

aignas
Copy link
Collaborator

@aignas aignas commented Jan 25, 2025

Finalize the uv extension interface employing a builder pattern so
that the users can specify the exact version that needs to be registered.
This also moves the registration of the actual toolchain to rules_python
itself and ensures that an incompatible noop toolchain is registered if
nothing is configured. This ensures that the register_toolchains("@uv//:all")
never fails.

If the url/sha256 values are not specified, this is falling back to
using the dist-manifest.json on the GH releases page so that
we can get the expected sha256 value of each available file and
download all of the usable archives. This means that rules_python no
longer needs to be updated for uv version bumps.

The remaining bits for closing the ticket:

  • Finalize the lock interface.
  • Add the locking target to the pip.parse hub repo if pyproject.toml
    is passed in.
  • Add a rule/target for venv creation.

Work towards #1975.

Sorry, something went wrong.

@aignas aignas requested a review from rickeylev as a code owner January 25, 2025 05:28
@aignas aignas requested a review from groodt January 25, 2025 05:31
@aignas
Copy link
Collaborator Author

aignas commented Jan 25, 2025

Not fully sure why it failed.

@aignas aignas marked this pull request as draft January 25, 2025 06:19
@aignas
Copy link
Collaborator Author

aignas commented Jan 25, 2025

I'll do the moving of files around in a separate PR so that the functionality con stay in this PR.

aignas added a commit to aignas/rules_python that referenced this pull request Jan 25, 2025
This PR starts establishing a structure that will eventually become a
part of our API. This is a prerequisite for bazel-contrib#2578 which removes the
versions.bzl file in favour of a more dynamic configuration of the
extension. We also remove the `defs.bzl` to establish a one symbol per
file convention.

Things that I wish we could change is `//python/uv:extensions.bzl` and
the fact that we have `extensions` in the load path. I think it cannot
be removed, because that may break the BCR test. On the other hand,
maybe we could remove it and do an alpha release to verify this
assumption.

Work towards bazel-contrib#1975
@aignas aignas force-pushed the feat/improve-uv-mgmt branch from f94d74c to 087fb57 Compare January 25, 2025 07:52
@aignas aignas changed the title feat: improve the uv mgmt and the structure feat(uv): parse the dist-manifest.json to not hardcode sha256 in rules_python Jan 25, 2025
github-merge-queue bot pushed a commit that referenced this pull request Jan 26, 2025

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
This PR starts establishing a structure that will eventually become a
part of our API. This is a prerequisite for #2578 which removes the
versions.bzl file in favour of a more dynamic configuration of the
extension. We also remove the `defs.bzl` to establish a one symbol per
file convention.

Things that I wish we could change is `//python/uv:extensions.bzl` and
the fact that we have `extensions` in the load path. I think it cannot
be removed, because that may break the BCR test. On the other hand,
maybe we could remove it and do an alpha release to verify this
assumption.

Work towards #1975
Copy link
Collaborator

@rickeylev rickeylev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only had an extra minute free, so didn't get far in review.

I see some new tag classes added. Can you give a quick overview of the thinking?

sha256 values for each binary.
"""
dist_manifest = module_ctx.path(_DIST_MANIFEST_JSON)
module_ctx.download(base_url + "/" + _DIST_MANIFEST_JSON, output = dist_manifest)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want a sha for the dist manifest download?

Using untrusted shas to verify a download is equivalent to not using shas

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the sha256, I am downloading the manifest that only has links and I don't see how the extra sha256 value here could be beneficial.

This has been inspired by https://github.com/bazel-contrib/rules_go/blob/7f6a9bf5870f2b5ffbba1615658676dcabf9edc7/go/private/sdk.bzl#L84 where rules_go is downloading things from the manifest if the user does not specify anything.

Maybe we should instead an optional sha256s dict where we have platform labels as keys and the sha values as values? And we could print a warning with buildozer command to add the said sha256s values to their MODULE.bazel file if they want things to be deterministic?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think I see your point now. Hm.

On the one hand, not using shas means, if someone MITM the manifest, then they can control what actual uv stuff is downloaded and then used.

On the other hand, using a sha means we are tied to a particular manifest and don't benefit from the automatic-ness of using the manifest. Which is pretty appealing behavior.

Hm, not sure how, or if, we can split the difference. These seem at odds.

Also:

  • Does the sha arg of download() allow bazel to better cache it?
  • Does the sha arg feed into MODULE.lock? I would expect so, since this is module-phase behavior.
  • Without the sha arg set, does Bazel print a warning? (http_archive does, not sure if download does)

@aignas
Copy link
Collaborator Author

aignas commented Jan 27, 2025

The main idea would be to leverage the dist-manifest.json files to download things. rules_python can set the defaults and eventually register the toolchains (if nothing is specified, we just register a noop toolchain).

The users can use the following tag_classes to interact:

  • uv.config - set the base_url for where to download things from.
  • uv.toolchain - set the version and the name of the toolchains repository. Though the name setting is probably not needed if rules_python is planning to register the toolchains.
  • uv.platform - add the platform definitions for the toolchain binaries.

After having written this, maybe we should just have:

  • uv.platform - as above
  • uv.config - set the version and the base_url of the toolchain the name being hardcoded.

My thinking was that any user may want to do the following to customize the uv usage in the locking rule we provide:

So all of the above could be done by:

uv.config(version="0.5.24") # override the version
uv.config(base_url="my_internal") # override the base URL
uv.platform(...) # add an extra platform to support locking on

And if they wish, they can just ignore this and register their own uv toolchain.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
This is using the `dist-manifest.json` on the GH releases page so that
we can get the expected `sha256` value of each available file and
download all of the usable archives. This means that `rules_python` no
longer needs to be updated for `uv` version bumps.

The remaining bits for closing the ticket:
- [ ] Finalize the `lock` interface.
- [ ] Add it to the `pip.parse` hub repo if `pyproject.toml` is passed
  in.
- [ ] Add a rule/target for `venv` creation.

Work towards bazel-contrib#1975.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

Verified

This commit was signed with the committer’s verified signature.
hawkw Eliza Weisman
@aignas aignas force-pushed the feat/improve-uv-mgmt branch from 9144ce8 to b8fa595 Compare January 27, 2025 10:13

Verified

This commit was signed with the committer’s verified signature.
hawkw Eliza Weisman
@aignas aignas marked this pull request as ready for review January 27, 2025 10:14
@aignas
Copy link
Collaborator Author

aignas commented Jan 27, 2025

Marking as ready to be reviewed even though this is not yet ready to be merged, but I would like feedbakc on the APIs.

@rickeylev
Copy link
Collaborator

+1 on uv.config.

I get uv.toolchain. It's telling what version of uv we want to use.

I don't fully grok uv.platform. Is the idea that a uv toolchain is registered for every platform() call? Ah -- it's carrying the constraint labels that get put onto the toolchain. Plus also telling the repo name to use. Is that it? Because putting those on uv.toolchain() is tricky; it would require passing e.g. {"linux_86": [//os:linux, //cpu:x86]}. Similar to requirements_by_platform. Is this what you meant in the maintainers meeting, when you mentioned "we define the platforms and their constraints" ?

I think I like it. The constraints for a platform are going to be the same, regardless of e.g. uv version.

Is part of the idea here that rules_python calls uv.platform() for the common ones? And if a user has a special platform, e.g there's a musl build of uv. they would do: uv.platform("musl-link", constraints=[os:linux, glibc:musl]) ?

},
)

platform = tag_class(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something that isn't clear from the doc here is how the word "for" is operating.

Does this define the platforms uv will run on, or the platform that uv will generate a lockfile for? I'm assuming the former.

@aignas
Copy link
Collaborator Author

aignas commented Jan 29, 2025

+1 on uv.config.

I get uv.toolchain. It's telling what version of uv we want to use.

Maybe we should just have uv.config instead of the separate tag classes that have barely any api?

I don't fully grok uv.platform. Is the idea that a uv toolchain is registered for every platform() call? Ah -- it's carrying the constraint labels that get put onto the toolchain. Plus also telling the repo name to use. Is that it? Because putting those on uv.toolchain() is tricky; it would require passing e.g. {"linux_86": [//os:linux, //cpu:x86]}. Similar to requirements_by_platform. Is this what you meant in the maintainers meeting, when you mentioned "we define the platforms and their constraints" ?

Correct. For uv we have to have some labels that we can put into tcw on the toolchain, so I am doing a very similar thing to what the python extension is doing under the hood.

I think I like it. The constraints for a platform are going to be the same, regardless of e.g. uv version.

Yes that is the idea.

Is part of the idea here that rules_python calls uv.platform() for the common ones? And if a user has a special platform, e.g there's a musl build of uv. they would do: uv.platform("musl-link", constraints=[os:linux, glibc:musl]) ?

Also correct.

@rickeylev
Copy link
Collaborator

rickeylev commented Jan 30, 2025

See below -- thoughts?

Maybe we should just have uv.config instead of the separate tag classes that have barely any api?

Hm. Let see what some psuedo code looks like and what it might imply.

uv.config(base_url="A") # L1
uv.config(version="0.1.0") # L2
uv.config(version="0.2.0", base_url="B") # L3
uv.config(platform="darwin_x86", constraints=[os:darwin, cpu:x86]) # L4
uv.config(platform="linux_arm", constraints=[os:linux, cpu_arm]) # L5
uv.config(version="0.3.0", platform="linux_musl_arm", constraints=[os:linux, cpu:arm, glibc:musl]) # L6
uv.config(version="0.4.0", constraints=[//user:flag]) # L7

I just merged all 3 tag classes and threw it in a blender like a code fuzzer, for exploratory purposes. I'm also throwing in complications like we see from the pip/python extensions just for fun.

Aesthetically, that doesn't look too bad. It seems fairly clear to understand? The docs for the config "function" will be a bit unwieldy, but that seems a minor loss.

Three observations


First: Recall that tag_class calls preserve their order. So, one way to interpret that pseudo code is:

By default, download from url A (L1)
By default, use version 0.1.0 (L2)
By default, every registration will attempt to register
  darwin_x86 (L4)
  linux_arm (L5)
  with their respective constraints.

Also register version 0.2.0 and download it from B. (L3)
  implicit: for platforms darwin_x86, linux_arm
Also register version 0.3.0 (L6)
  explicit: but don't use the default platforms. Instead just "linux_musl_arm"
  implicit: download it from A
  i.e. toolchain(name="0.3.0_linux_musl_arm", version=0.3.0,
                 constraints=[linux, arm, musl])
Also register version 0.4.0 (L7)
  explicit: add constraint //user:flag
  implicit: for platforms darwin_x86, linux_arm, download from A
  i.e. toolchain(name="0.4.0_{darwin,linux}", version=0.4.0,
                 target_settings[//user:flag], ...)

It's kind of neat I was able to be so specific.

An interesting implication of respecting the order of calls is I can easily insert toolchains earlier in the ordering. This allows me to define a toolchain with more specific criteria (linux, arm, musl) so that it matches before one with more broad criteria (linux, arm).

Hm, though, things that affect "defaults" seem worthy of looking more distinct (different "function call"), e.g. L1,2,4,5. The order of those doesn't matter. With the sort-of exception of L1 ("first defined version is default").

(edit: inspired aside: maybe add e.g. python.rules_python_internal_set_defaults(...). It'd remove the "if module == rules_python" block that gets interjected into the normal code flow and allow it to be handled specially in a more first-class manner.)


Second: as I wrote the pseudo code, I was reminded of builder patterns, in particular Java protos. These are pretty good at giving fine-grained control in pretty clear and understandable ways:

config = ConfigProto.newBuilder()

tc = config.addToolchain().setVersion("0.1.0").setUrl("B")
  .setConstraints([...])
tc.addPlatform().setName("darwin_x86").setConstraints(...)
tc.addPlatform()...etc...

My point being, were it available, we'd probably use a builder style API. The accumulative nature of tag_class calls seems akin to that.


Third: we have multiple dimensions of interaction: version x platform constraints x url x arbitrary constraints. What we're trying to do give users enough control over those variables and their expansion that they get what they need without too much of what they don't need. This is a pretty tall ask of any API.


Fourth: I'm reminded of the recent PR that reads the is_default setting from a file. Another approach would be to offload everything into a separate config file altogether. This seems like overkill, but it did pop into my head.

@aignas
Copy link
Collaborator Author

aignas commented Feb 1, 2025

I actually like the expressivity of the API you came up with for the platform definitions. Maybe it would be an interesting thing to have two things:

uv.defaults() # used by `rules_python` to set the default platforms, version, etc
uv.config() # used by the users to customize things similar to what you suggested above

I think that would be also interesting in the pip.parse case for:

pip.parse(hub_name = "foo", requirements_lock = "default_requirements_lock.txt")
pip.parse(hub_name = "foo", requirements_lock = "requirements.linux_x86_64.txt", constraint_values = ["@platforms//os:linux", "@platforms//cpu:x86_64"])
pip.parse(hub_name = "foo", requirements_lock = "requirements.gpu.linux_x86_64.txt", constraint_values = ["@platforms//os:linux", "@platforms//cpu:x86_64"], flag_values = ["@//:my_gpu_flag": "enabled"])

This is related to #2548 and could potentially be a good way to allow users to modify things.

However, the pip.parse and the uv.config implications/complications are a little bit worrying:

  • How do you validate things correctly? The user might be mistakenly missing some of the arguments and we are interpreting them as "the guy must want to append to the previous declaration".
  • Bazel quite often is thought to be too complex for "regular human beings", who are not working at FAANG and I am curious if we are increasing the entry barrier here?

Maybe we could solve the concerns above by adding another attribute called add, which users would have to use for this different behaviour or we can have a separate tag_class that is doing either building or defining everything at once. Thus, users cannot get into a situation where they are making mistakes.

I am thinking that the following would be unambiguous:

uv.config() # the builder pattern
...
uv.toolchain() # non-builder pattern

# likewise for pip
pip.parse() # non-builder pattern
pip.config() # builder pattern

I'll think about this more and when I have time to finish implementation might go with the builder/non-builder patterns present in the same bzlmod extension.

@rickeylev
Copy link
Collaborator

have an add attr/tag_class

Yeah, +1 to something like this. That config() really means "merge these values into the current state" doesn't well jive with the noun "config" word, and the "merge" semantics for the verb word seem out of place.

add_config(), merge_config(), mixin_config() ?

I like uv.defaults() for setting the defaults.

@aignas
Copy link
Collaborator Author

aignas commented Feb 2, 2025

Other ideas for naming: append_config, build_config, append_last_config, merge_last_config, add_or_merge_config.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
@aignas aignas requested a review from rickeylev March 7, 2025 15:11
@aignas
Copy link
Collaborator Author

aignas commented Mar 9, 2025

OK, PTAL. I think this is somewhat ready to be merged if the current implementation in how the toolchains are configured is what we want (i.e. last one wins).

Hmmm. I realized that the root module is always going to be the first one to be processed, so we may want to have the first one wins in the toolchain hub. I'll change the implementation tomorrow to do that, as I have run out of time to do that today.

@aignas
Copy link
Collaborator Author

aignas commented Mar 10, 2025

OK, fixed the toolchain definition macro to have the correct behaviour and added tests to ensure that rules_python never overrides anything that the root module configures.

@rickeylev
Copy link
Collaborator

Looks good, great job!

And the whole implementation is actually pretty simple? And yet seems very powerful. I'm liking it so far.

aignas added 2 commits March 11, 2025 07:23
@aignas aignas enabled auto-merge March 11, 2025 07:24
@aignas aignas added this pull request to the merge queue Mar 11, 2025
Merged via the queue into bazel-contrib:main with commit 4cb8412 Mar 11, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants