Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Optimization of linkcheck builder's workers (follow-up #11324) #11346

Open
picnixz opened this issue Apr 21, 2023 · 7 comments
Open

Comments

@picnixz
Copy link
Member

picnixz commented Apr 21, 2023

This is a follow-up of #11324 (comment) to formalize #11324 (comment)

The linkcheck builder extracts all the links that should be checked, putting them in a queue so that a HyperlinkAvailabilityCheckWorker can check the link for availability. The number of workers is predefined (say n) at the start of the build and does not change. Since we are slowly moving to using sessions objects instead of standalone requests, at any point in time, there should have at most n sessions running, one for each worker.

Assume that there are two workers and three links to check, say bar.org/x, bar.org/y and foo.org. The queue may look like Q = ['bar.org/x', 'foo.org', 'bar.org/y']. Then, Worker 1 and Worker 2 would check Q[0] and Q[1] simultaneously. If Worker 2 finishes before Worker 1, then Worker 1 checks Q[2]. Ideally, it would have been better to let Worker 1 check that link since this would not have needed a new Session to be opened.

Since the links to check are known before checking anything, one can first pre-process them so that and reorganize them as follows:

  • Each link is group by its domain so that a single session can check multiple links within the same domain without having to re-open a TCP connection each time.
  • Links that are "standalone" are checked at the end (or at the beginning) or by a standalone worker (only responsible for that). When the other workers are done and don't have "bulk" links to check anymore, they will help that standalone worker.
  • Worker 1 to n-1 process together the links of the first domain. If one worker is done before the others, they would help the standalone worker processing the standalone links. Otherwise, they move to processing the other domains and their links.
  • Links within a domain should be dispatched to multiple workers if we don't end-up with n-1 workers processing a single link. Like, if I have 5 workers and 5 links of the same domain to process, it's probably more efficient to only use 2 workers instead of 5 since we would have opened only 2 TCP connections instead of 5. I think we should investigate hwo to properly split the chunks and delegate to the workers (and that's the most hard part of this feature IMHO).

Alternatively, we could assume that there are only a worker responsible for processing a domain and the others only help if there is nothing else to do. If there are 15 links for domain 1 and 5 links for domain 2, the flow is as follows:

  • t=0 --- Worker 1 processes links 1 to 5 of domain 1 and Worker 2 processes all links of domain 2.
  • t=1 --- Worker 1 processes links 6 to 10 of domain 1 and Worker 2 processes links 11 to 15 of domain 1.

The idea is to balance as much as possible the blocks so that the number of TCP connections to open is as small as possible and there is no waiting time between two checks.


This feature has probably a very low priority because the implementation will not be trivial and we might need a PoC to know whether this is really worth implementing. For projects with many links of different domains, the current implementation is suitable but for projects using sphinx.ext.intersphinx, this may be of interest. Still it is good to have an opened issue where we could discuss.


Related:

@francoisfreitag
Copy link
Contributor

Also see related suggestion for improvement #4303.

@jayaddison
Copy link
Contributor

There's an idea in #9592 that could be worth considering: the linkchecker could keep a record of hosts where TLS failures have occurred, and avoid re-checking those.

(I think the key for such a lookup table would be fully-qualified domain name plus TCP port number, but please correct me if I'm wrong. also this wouldn't have to be implemented at the same time as any other optimizations, but if it seems worthwhile, then designing with the possibility of adding it would be good)

@picnixz
Copy link
Member Author

picnixz commented May 22, 2023

Depending on the (target) server configuration, you may have some links that are accessible without triggering a TLS failure while others are, so discarding the whole of them may be a bit too much although it is quite unlikely. As such I would agree to blacklist the whole domain.

I think the key for such a lookup table would be fully-qualified domain name plus TCP port number

If we blacklist by DNS A or AAAA records, we may blacklist fallback IPs that could have been ok. I'd prefer no to but I'm not sure how to actually force the DNS to switch to another IP if one fails except using an ANY query like dig URL any directly. So we would need to write our custom HTTP adapter in order to query the correct URL when retrying.

@jayaddison
Copy link
Contributor

Do you think that invalidating the session if an SSLError occurs (in other words: skip all of the rest of the URLs for the session domain) would be reasonable in an initial implementation?

@picnixz
Copy link
Member Author

picnixz commented May 22, 2023

By invalidating the session, do you mean closing it naturally? if so, I would say "yes", because I don't think it's good to have pending resources. It might be better to actually push failing links at the end of the process instead of trying to check them again and again. One reason is that it may help us in dealing in advance with links that would fail because of a blacklisted domain instead of letting them being processed by another worker.

One challenge is to have a shared object containing the blacklisted domains but with some manager, it should be easy to do (+ ensuring that operations are atomic).

@jayaddison
Copy link
Contributor

By invalidating the session, do you mean closing it naturally?

I think so, yes. My idea was probably slightly imprecise, even to me, but my goal is: let's find an approach that implements most of the behaviour and optimization that we want, while also being straightforward to implement and understand.

If we imagine the domain-grouped-hyperlinks as stacks on pallets that arrive via the entrance of a virtual factory workfloor, then I think what I'm suggesting is that if any individual pallet's stack doesn't pass an inspection check (TLS connection setup), then in v1 of the workflow we simply skip the rest of the workflow for that pallet's contents (all the items for that domain) and redirect it to one of the exits.

That would avoid the need to share and/or communicate any context across the factory workflows (shared objects across threads). Maybe not optimal long-term, but a simple initial implementation (and hopefully easier to review).

@picnixz
Copy link
Member Author

picnixz commented May 22, 2023

Implementwise, it makes sense. Thing is we could abstract your "stack of pallet" so that we can an elegant flow:

  • All links are splitted into stacks of pallets by some algorithm. This algorithm could even be custom !
  • We can go for a visitor pattern which validates different flavours of pallets or have one generic filter method which would validate a single pallet object. Same idea as for a logging Filter.
  • When a pallet is validated, it goes to the actual validation. Otherwise, it is handled by some error callback.

The splitting phase should be done by the main thread. The pallet validation can be done in a multi-threaded way because each pallet should be independent. Then, the actual validation would also be multi-threaded. The error callback would then aggregate the different bad pallets, deciding whether to resend them for a validation phase, possibly by changing them (like pruning some parts, or splitting them even further, anything can be actually done) and we repeat the process.

Using a pallet approach should allow us to plug-in a bunch of additional steps. Sharing a state can be done separately and this would allow to add a dependency between pallets if needed. But as a v1, we don't necessarily need it.

What I fear the most is that, while it may be really be good on paper, I am unsure of how much of improvements we will gain (maybe we won't gain any !).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants