Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only update the ChordCounter.count field when saving #347

Merged
merged 1 commit into from
Oct 21, 2022
Merged

Only update the ChordCounter.count field when saving #347

merged 1 commit into from
Oct 21, 2022

Conversation

orf
Copy link
Contributor

@orf orf commented Oct 20, 2022

With the current implementation the following SQL query will be continually executed on every task:

UPDATE "django_celery_results_chordcounter" 
SET "group_id" = 'f81ac317-eb61-42a6-a750-7cc1feb78b8c', 
"sub_tasks" = '[[["6a2f07d8-6b6f-4286-9d71-dbead46e9b1c", null], null], ..', 
"count" = 122 
WHERE "django_celery_results_chordcounter"."id" = 118

If you are using a large number of sub-tasks, ranging in the thousands or tens of thousands, then continually sending the sub_tasks in every update can be very expensive.

If we use update_fields then we can skip this.

Verified

This commit was signed with the committer’s verified signature.
bentonam Aaron
@orf
Copy link
Contributor Author

orf commented Oct 20, 2022

FYI we deployed this change and saw an immediate drop in our write throughput,

Screenshot 2022-10-20 at 15 15 55

network traffic:

Screenshot 2022-10-20 at 15 16 13

and write IOPs.

@auvipy
Copy link
Member

auvipy commented Oct 20, 2022

can you confirm this won't create any regression?

@orf
Copy link
Contributor Author

orf commented Oct 20, 2022

Is it ever possible to confirm that it won’t cause any regression?

All I can say is that the current code doesn’t need to update anything other than the count column, and that I’ve validated this on complex real world workflows and it works as expected.

One of our larger chords results in a ~7mb JSON string being updated every time “save” is called, which may be multiple times per second. Our internal monitoring showed Postgres triggered about ~4,700 “internal” row updates and deletes to complete this, which causes significant stress on the various IO subsystems within PG.

Throughout the execution of all of these queries, the JSON string remained the same.

@auvipy auvipy merged commit 945c009 into celery:master Oct 21, 2022
@auvipy
Copy link
Member

auvipy commented Oct 21, 2022

yup legit, make sense. thanks a lot

@orf orf deleted the patch-1 branch October 21, 2022 17:18
@orf orf restored the patch-1 branch October 21, 2022 17:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants