-
Notifications
You must be signed in to change notification settings - Fork 462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sentry is missing some of cron check-ins when used with sentry-python #2617
Comments
Thanks for creating the issue @IevgeniiB ! |
We have plans to have propper support for Airflow for our Crons product soon. The goal is that you do not need to create checkins by hand but the Airflow integration will handle this. |
@antonpirker This is great news! I noticed issues like that outside of airflow too, do you think it may be related to airflow? |
I am not sure if it is related to Airflow. The tasks finish without errors, just the check-ins are not sent? Could you turn on |
Anton, thank you for the suggestion. I've tried it and saw the following:
The logs before disabling the integrations, init checkin:
The logs before disabling the integrations, success checkin:
The logs after disabling the integrations, init checkin:
The logs after disabling the integrations, success checkin:
|
This looks all good. So the init checkin (setting status |
Everything is correct:
Is there something else that I should try? Maybe use http api instead of sentry-python? |
Hey @IevgeniiB thanks for all the info, I asked our server side guys if they have an idea what can cause this. |
Hi @antonpirker, is it possible that this is related to the fix in #2598? I don't have enough knowledge about Sentry to say so myself. But I noticed that apparently there hasn't been a release that contains this fix yet. |
Hey @SoerenWeber, I don't think this is related to #2598 since if the type was wrong, I'd expect no check-ins at all. But the issue here appears to be intermittent. |
Thanks, @sentrivana, that's a good point. I made this connection because I see a similar behavior on my end. In my case, the behavior seems to be correlated with the amount of time the job takes to finish. If it's less than the grace period it's counted as successful in Sentry, otherwise a missed check-in is reported. With varying workload, this also looks like intermittent missing check-ins. |
Hey @IevgeniiB and @SoerenWeber, thanks for reaching out. Could you please share the affected monitor URLs with us so we can further investigate on our end? You can email it directly to us at crons-feedback@sentry.io. Thanks. |
@gaprl thank you for looking into it, I've sent the URL now. Please let me know if this is not the URL you're looking for or if you need something else. |
Thanks for following up @IevgeniiB -- I don't any additional suggestions from the SDK side of things, but our crons folks are taking a look. |
Thank you, @sentrivana! Please let me know if I can provide more information to help debug this issue. |
Just a quick note: On my side, the behavior stopped after introducing version 1.39.2. (Bear in mind that we might experience different issues here.) |
Thanks for following up @SoerenWeber, glad that upgrading has fixed the issue for you! We might be dealing with two different issues but in any case worth a try -- @IevgeniiB could you see if upgrading to 1.39.2 changes anything? |
@sentrivana hi! Unfortunately, it didn't improve the stability of my monitor... Thank you for the suggestion! |
We've also been experiencing this since we started using cron monitors. All our monitors are Celery/Beat/Django auto-instrumented. There are no other indication that the tasks are actually failing, the error appears to only be in the monitoring itself. I have mostly been ignoring this and muted all alerts, but since the trial period is over I figured I should report some details and see if it can help you with fixing this. As it is right now the crons service isn't providing much value for us since it cries wolf all the time and I might be tempted to disable it to not pay for that, but before that I should try to help you debug if I can. Let me know if I can provide you with more details. System
The app lives in a k8s cluster with the webapp and workers as separate pods that can scale and redeploy independently. I did upgrade Screenshots |
@mathiasose Thanks for helping us debug, this is much appreciated! The failure rates look bad indeed -- how many of those are expected failures and how many are the buggy timeouts? I don't need specific numbers, just a rough idea to understand the scope of the issue. What schedules and cron durations are we looking at in the two screenshots? Do you see any difference in the failure rates based on the schedule/duration or does it look roughly the same? Additionally, could you enable
If this is a network issue (which seems like the most likely explanation to me atm), adjusting the socket options might make a difference. Could you try upgrading to at least 1.41.0 and trying out the second snippet from the release notes? |
In our case I would expect more or less 100% green everywhere - I could see a few tasks being missed in cases where they are scheduled to run exactly when we deploy a new revision of the app or something, but mostly all the tasks seem to be running as they should, and cron monitoring is just mistaken in reporting errors. I created a debug task today and deployed it do our dev environment only. It runs once per minute, sleeps for 10 seconds during execution then returns. I deployed it with debug=True on sentry-sdk 1.40 first and we had three misses that you can see in the screenshot. Then I deployed the bump to 1.41 and the socket options, and it seems maybe to have improved, but we did get another miss later. Each miss seems to be accompanied by an exception during the send_envelope request. I notice there is some NewRelic instrumentation that is affecting the http requests, will have to dig a little to see if that could have anything to do with this.
|
Update: It does seem much more stable on the dev environment after the 1.41 upgrade with the socket options, so I will be making that upgrade to our staging and prod envs as well. The only task that had errors in that environment post-bump was the debug task that I added, so if not for that I would have read this as the problem being 100% solved. I guess the error happens much less with these connection options and the high frequency of the debug task is what reveals it, but it would eventually happen to other tasks as well. |
Thanks @mathiasose. I see two follow-ups for us here: making it easier to enable the alternative connection options (just having a single option, something like |
Now that we've had the new settings applied for a few days I wanted to confirm here that things seem to work much better now 🙏 There's been a couple hiccups that I still don't completely understand (especially the longish period of red in dev and production 6-ish days ago), but as long as this is pretty rare still then Crons is a much more helpful tool for us to view now 👍 These errors might be legitimate application errors that we need to investigate, and now they're not drowning in false alerts. The last row in the screenshot for example was red because of a database configuration issue and Crons reported the failure accurately, which pointed us towards the issue and we made a fix and got the monitor back to green. |
Awesome @mathiasose, thanks for following up. Starting with 1.43.0 you can swap the If you find out more about the remaining hiccups please let us know if they also look like SDK/network issues. |
getsentry/sentry-python#2617 (comment) Hoping enabling this will help cron monitors have fewer time-outs
How do you use Sentry?
Sentry Saas (sentry.io)
Version
1.39.1
Steps to Reproduce
I'm using this task to send check-ins from airflow:
I use this task in my airflow dag, it's set up to run on every execution: before the main logic, and after. It runs on every execution.
I used to have celery and django integrations included because I needed them in the past. Removing them improved the results.
Expected Result
All check-ins to sentry are visible in the cron tab in sentry UI. There are no missed checkins when the tasks are working as expected.
The job runs every 10 minutes, I expect evenly spaced successful check-ins.
Actual Result
Both initial check-ins and completion check-ins may be missing from time to time.
With celery and django integrations the history of check-ins looks like this:
Without celery and django integrations:
The text was updated successfully, but these errors were encountered: