Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate that host_whitelist is not a string #349

Merged
merged 6 commits into from
Jan 4, 2024

Conversation

timmc
Copy link
Contributor

@timmc timmc commented Aug 30, 2022

Attacker can use https:///evil.com to make a malformed "hostless" URL
that would have a netloc == '' -- which is in any string. Strings
are not documented to be allowed in this config variable anyhow, so just
raise a type error if someone passes in a string by accident.

(This is a breaking change for people who didn't follow the documented
types, but shouldn't affect anyone else.)

New test fails on current master.

Attacker can use `https:///evil.com` to make a malformed "hostless" URL
that would have a `netloc == ''` -- which is `in` any string. Strings
are not documented to be allowed in this config variable anyhow, so just
raise a type error if someone passes in a string by accident.

(This is a breaking change for people who didn't follow the documented
types, but shouldn't affect anyone else.)

New test fails on current master.
@timmc timmc force-pushed the timmc/typecheck-host-whitelist branch from 813e921 to 5b2de2f Compare August 30, 2022 00:23
host_whitelist = ()
host_whitelist = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to change this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not necessary to change, but I saw a lot of code out there passing in an empty tuple. Passing an empty list would make it less likely that someone accidentally passes in ("example.com") and is confused about the error. But the guard later on should catch it anyhow!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, up to you! I'm fine backing it out if you think it should stay a tuple.

Comment on lines 246 to 247
if not isinstance(self.host_whitelist, collection_types):
raise TypeError("host_whitelist must be a collection type")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice to always turn the value into a frozenset (if non-empty).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As defensive copying, and for speed of lookups?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both, yes. We should generally own config collections that the user passed into the constructor, and not let future modifications leak in. I'm aware that that wasn't done before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, just noticed that you said non-empty -- why so?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think -- any further changes needed?

@scoder
Copy link
Member

scoder commented Oct 11, 2022 via email

src/lxml/html/clean.py Outdated Show resolved Hide resolved
@scoder scoder merged commit 5fa0cd5 into lxml:master Jan 4, 2024
50 of 51 checks passed
@scoder
Copy link
Member

scoder commented Jan 4, 2024

Thanks

@timmc timmc deleted the timmc/typecheck-host-whitelist branch January 5, 2024 00:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants