Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of the reverse dependencies endpoint #3655

Open
pietroalbini opened this issue May 27, 2021 · 4 comments
Open

Improve performance of the reverse dependencies endpoint #3655

pietroalbini opened this issue May 27, 2021 · 4 comments
Labels
A-backend ⚙️ C-internal 🔧 Category: Nonessential work that would make the codebase more consistent or clear

Comments

@pietroalbini
Copy link
Member

Me and Justin spent some time a couple days ago trying to figure out why the reverse dependencies endpoint is so slow for crates with a large number of reverse dependencies. This is the EXPLAIN ANALYZE of that query.

The goal of the query is to return all crates whose last version depends on the current crate. As the latest version is not stored anywhere the query is doing a subquery that sorts all the versions of each crate by their semver, which is really slow.

We should denormalize the latest version ID (semver-wise) in the crates table, and change the query to remove the subquery. After removing the subquery the query should be small enough to be converted to Diesel. Ideally denormalizing the latest version ID and updating the query should be done in separate PRs.

@pietroalbini
Copy link
Member Author

Note that this is practically reversing what we did in #592, but this is now starting to actually impact our performance.

@Turbo87 Turbo87 added the C-internal 🔧 Category: Nonessential work that would make the codebase more consistent or clear label Jun 25, 2021
@jyn514
Copy link
Member

jyn514 commented Jun 27, 2021

Is it possible to use an index instead of a column? That would avoid issues with the column getting out of sync.

@pietroalbini
Copy link
Member Author

I'm pretty sure that query is too complex to go on an index. A materialized view might work depending on how long it takes for it to update.

@pietroalbini
Copy link
Member Author

Yeah, played around with it and indexes don't seem to help much. This query though takes around 3.2s (explain) on the primary db:

SELECT DISTINCT ON (crate_id) crate_id, id
FROM versions
WHERE NOT yanked
ORDER BY crate_id, to_semver_no_prerelease(num) DESC NULLS LAST;

We could turn it into a materialized view we refresh every time a new crate is published/yanked/unyanked/deleted, but I'm worried its speed will continue to decrease as time goes by. Another option to investigate would be to store the parsed semver in the database, without calling to_semver_no_prerelease in the database which is just slow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-backend ⚙️ C-internal 🔧 Category: Nonessential work that would make the codebase more consistent or clear
Projects
None yet
Development

No branches or pull requests

3 participants