Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Simplified RequestQueueV2 implementation #2775

Merged
merged 31 commits into from
Feb 11, 2025
Merged

Conversation

janbuchar
Copy link
Contributor

@janbuchar janbuchar commented Dec 17, 2024

@janbuchar janbuchar added the t-tooling Issues with this label are in the ownership of the tooling team. label Dec 17, 2024
@github-actions github-actions bot added this to the 105th sprint - Tooling team milestone Dec 17, 2024
@janbuchar janbuchar marked this pull request as draft December 17, 2024 14:59
Copy link
Member

@drobnikj drobnikj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 💪

I would do some testing myself, but the first what about some unit tests, did you consider,add some? There are none -> https://github.com/apify/crawlee/blob/03951bdba8fb34f6bed00d1b68240ff7cd0bacbf/test/core/storages/request_queue.test.ts
Honesly, we are dealing with various bugs during time and we do not have any tests for these features still.

@drobnikj
Copy link
Member

drobnikj commented Jan 6, 2025

The build did not finish, can you check @janbuchar ?
I would like to test it in some Actors.

@janbuchar
Copy link
Contributor Author

The build did not finish, can you check @janbuchar ? I would like to test it in some Actors.

I can, but only later this week - I have different stuff to finish first.

@github-actions github-actions bot added the tested Temporary label used only programatically for some analytics. label Jan 22, 2025
Bug
@janbuchar
Copy link
Contributor Author

@drobnikj the unit tests are now passing so you should be able to build. I'm still working on some e2e tests, if you have any ideas for scenarios to test (e2e, unit, doesn't matter), I'd love to hear those.

@drobnikj drobnikj self-requested a review January 30, 2025 10:41
Copy link
Member

@drobnikj drobnikj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I did not find any issue, even during testing.
I have a few more comments, can you check pls? @janbuchar

@@ -361,7 +430,8 @@ export class RequestQueue extends RequestProvider {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot comment it below but during code review, I see that we are removing locks one by one in _clearPossibleLock.
see

while ((requestId = this.queueHeadIds.removeFirst()) !== null) {

There is 200 rps rate limit. I would remove lock in some batches maybe 10 to speed it up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean? I don't think there is a batch unlock endpoint. Launching those requests in parallel surely won't help against rate limiting, too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean to unlock in some batches like

protected async _clearPossibleLocks() {
    this.queuePausedForMigration = true;
    let requestId: string | null;
    const batchSize = 10;
    const deleteRequests: Promise<void>[] = [];

    // eslint-disable-next-line no-cond-assign
    while ((requestId = this.queueHeadIds.removeFirst()) !== null) {
        deleteRequests.push(
            this.client.deleteRequestLock(requestId).catch(() => {
                // We don't have the lock, or the request was never locked. Either way it's fine
            })
        );

        if (deleteRequests.length >= batchSize) {
            // Process the batch of 10
            await Promise.all(deleteRequests);
            deleteRequests.length = 0; // Reset the array for the next batch
        }
    }

    // Process any remaining requests that didn't form a full batch
    if (deleteRequests.length > 0) {
        await Promise.all(deleteRequests);
    }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. However, I still doubt that there will be any measurable benefit - this code is only executed on migration and there shouldn't be more than ~25 requests in the queue head.

@janbuchar janbuchar marked this pull request as ready for review February 4, 2025 22:49
@janbuchar janbuchar requested review from barjin and drobnikj February 4, 2025 22:49
@janbuchar
Copy link
Contributor Author

@barjin I gave the forefront handling a makeover. If you could check that out, I'd be super grateful.

@barjin
Copy link
Contributor

barjin commented Feb 5, 2025

Looking good to me 👍🏽 I remember reversing the forefront array somewhere already (likely memory-storage?), but as long as those tests are passing, this part is IMO good to go.

Copy link
Member

@drobnikj drobnikj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like almost all my notes were addressed and I commented the rest.

@@ -361,7 +430,8 @@ export class RequestQueue extends RequestProvider {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean to unlock in some batches like

protected async _clearPossibleLocks() {
    this.queuePausedForMigration = true;
    let requestId: string | null;
    const batchSize = 10;
    const deleteRequests: Promise<void>[] = [];

    // eslint-disable-next-line no-cond-assign
    while ((requestId = this.queueHeadIds.removeFirst()) !== null) {
        deleteRequests.push(
            this.client.deleteRequestLock(requestId).catch(() => {
                // We don't have the lock, or the request was never locked. Either way it's fine
            })
        );

        if (deleteRequests.length >= batchSize) {
            // Process the batch of 10
            await Promise.all(deleteRequests);
            deleteRequests.length = 0; // Reset the array for the next batch
        }
    }

    // Process any remaining requests that didn't form a full batch
    if (deleteRequests.length > 0) {
        await Promise.all(deleteRequests);
    }
}

Co-authored-by: Vlad Frangu <me@vladfrangu.dev>
@janbuchar janbuchar requested a review from vladfrangu February 10, 2025 17:51
Copy link
Member

@vladfrangu vladfrangu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm once the format is fixed (woops, sorryy ;w;)

Copy link
Member

@B4nan B4nan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few comments from my end, nothing really blocking so approving

@@ -673,7 +673,7 @@ export class BasicCrawler<Context extends CrawlingContext = BasicCrawlingContext
this.requestQueue.internalTimeoutMillis = this.internalTimeoutMillis;
// for request queue v2, we want to lock requests by the timeout that would also account for internals (plus 5 seconds padding), but
// with a minimum of a minute
this.requestQueue.requestLockSecs = Math.max(this.internalTimeoutMillis / 1000 + 5, 60);
this.requestQueue.requestLockSecs = Math.max(this.requestHandlerTimeoutMillis / 1000 + 5, 60);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the comment still mentions the internal timeout

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm honestly not sure what it was trying to say so I reworded it.

Comment on lines 414 to 417
this.inProgressRequestBatches.push(promise);
void promise.finally(() => {
this.inProgressRequestBatches = this.inProgressRequestBatches.filter((it) => it !== promise);
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how many items do you we expect in that array in a high concurrency run? this solution is not the best one, but if the size wont be large, we can keep it.

how is this different than a simple integer counter? that would be the most performant approach, just increment instead of push and decrement in the finally block

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The integer counter was in fact the previous implementation. However, it could not work with multiple clients, and we cannot reliably detect that - the queueHadMultipleClients flag is set even if the other client was a pre-migration instance of the same run, if that makes sense.

You are right that each forefront request might make us lock 25 more requests, and that could unbalance parallel instances quite a bit. Maybe we should give up "excess" requests after we're done checking for forefront requests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, not sure I follow why the counter wouldn't be enough, how is this better? Each client will have its own local cache (this new var). You store values in an array and wipe them based on identity, but the promises are not really used anywhere. My suggestion is doing the same, just without the memory/perf overhead.

Just to be sure, this is what I meant, it still uses the promise.finally:

this.inProgressRequestBatches++;
void promise.finally(() => {
    this.inProgressRequestBatches--;
});

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh damn, I'm sorry. I thought you are commenting on a different part of code - the one that handles forefront requests. If it's any help, you pushed me to tie up a loose end that I forgot about.

Regarding the batches, you're probably right 😁

Comment on lines +193 to +195
if (this.queueHeadIds.length() > 0) {
return false;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess the duplicity (same check 5 lines later) here is for performance reasons?

Copy link
Contributor Author

@janbuchar janbuchar Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. If the queueHeadIds is non-empty, we return immediately, otherwise we try to fetch something from the upstream queue, which may take time. I'll add a comment.

@janbuchar janbuchar merged commit d1a094a into master Feb 11, 2025
9 checks passed
@janbuchar janbuchar deleted the simplify-rq-v2 branch February 11, 2025 14:33
janbuchar added a commit to apify/crawlee-python that referenced this pull request Mar 11, 2025
This PR ports over the changes from
apify/crawlee#2775.

Key changes:

- tracking of "locked" or "in progress" requests was moved from
`storages.RequestQueue` to request storage client implementations
- queue head cache gets invalidated after we enqueue a new forefront
request (before that, it would only be processed after the current head
cache is consumed)
- the `RequestQueue.is_finished` function has been rewritten to avoid
race conditions
- I tried running SDK integration tests with these changes and they
passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Utilize queueHasLockedRequests to simplify RequestQueue v2
5 participants