New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fix #1726] Use boto3 for SQS async requests #1759
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please check the build failures?
Yeah, did notice the lint broke. I will fix it. Meanwhile, I would still appreciate feedback on the code. Thanks. |
it is not lint broke :) its a test failure =================================== FAILURES =================================== self = <t.unit.transport.test_SQS.test_Channel object at 0x7f7f958ff610>
E AssertionError: assert 0 == 1 |
Ah, ok, sorry I missed that. I ran tests locally, but I had to scope it to some of the files only as I started getting some Azure related errors. @auvipy , what is the easiest way to run all tests locally? |
@auvipy , I fixed the build issues. Could you please take a look? |
yes, thanks! I am going to review it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the codes look good. can you also check in the docs to see if anything need adjustment? thanks for your effort here
Sure, will do. |
e117893
to
f85d2c7
Compare
for more information, see https://pre-commit.ci
Hi @auvipy . I checked the documentation under the doc/ folder and I didn't see anything related to the methods I removed so we should be good. I also rebased my branch and fixed a couple of issues the linter was complaining about. Finally, 3 members from my team also reviewed the change and they think the change looks good. Please let me know if there is anything else that I can help with. One important thing I would like to mention is regarding the methods I removed. I confirmed that those methods are not being used within Celery, but I cannot confirm they are not referenced outside. However, since those methods are not documented, and they are kind of internal methods, they are not supposed to be used by others, and if that's the case, then the using code should be updated accordingly, much like how we are updating kombu to avoid using botocore's internal stuff. Please let me your thoughts on this. |
I agree with your view points, I am going to review it and merge when feel confident. thanks for taking on this |
Thanks a lot Rafid! |
we got a report on performance regression here #1783 (comment). would be thankful to have this verified |
we might need to revert the change #1799 |
@auvipy , I don't think this is a good idea. Reverting the change is a ticking bomb that will result in kombu being unusable completely when botocore is updated. Performance degradation seems the lesser of the two evils, assuming it is indeed caused by my PR. What is the timeline for the new release? I can try to find sometime to debug the issue to see if I can root cause the performance degradation issue. |
reverting is the last resort. you can take your time to re investigate the issue in coming weeks. |
@auvipy , what is the time frame we are talking about? I have an important deadline on October 18. If that's too far, I will try to find someone else in my team to take a look at this. |
That is a little bit too late. We want to cut a release this month, for bug fix stuff. But since it is a big change we can stop reverting for now and wait till 30 Oct, 2023 to find a Solution. |
We think that there indeed is an issue with Pull Request, and my colleague and I are actively working on root causing and fixing this issue. We will keep you updated. |
…)" This reverts commit 862d0bc.
TL;DR - The use of boto3 in celery#1759 resulted in relying on blocking (synchronous) HTTP requests, which caused the performance issue reported in celery#1783. `kombu` previously used to craft AWS requests manually as explained in detail in celery#1726, which resulted in an outage when botocore temporarily changed the default protocol to JSON (before rolling back due to the impact on celery and airflow.) To fix the issue, I submitted celery#1759, which changes `kombu` to use `boto3` instead of manually crafting AWS requests. This way when boto3 changes the default protocol, kombu won't be impacted. While working on celery#1759, I did extensive debugging to understand the multi-threading nature of kombu. What I discovered is that there isn't an actual multi-threading in the true sense of the word, but an event loop that runs on the same thread and process and orchestrate the communication with SQS. As such, it didn't appear to me that there is anything to worry about my change, and the testing I did didn't discover any issue. However, it turns out that while kombu's event loop doesn't have actual multi-threading, its [reliance on pycurl](https://github.com/celery/kombu/blob/main/kombu/asynchronous/http/curl.py#L48) (and thus libcurl) meant that the requests to AWS were being done asynchronously. On the other hand, boto3 requests are always done synchronously, i.e. they are blocking requests. The above meant that my change introduced blocking on the event loop of kombu. This is fine in most of the cases, since the requests to SQS are pretty fast. However, in the case of using long-polling, a call to SQS's ReceiveMessage can last up to 20 seconds (depending on the user configuration). To solve this problem, I rolled back my earlier changes and, instead, to address the issue reported in celery#1726, I now borrowed the implementation of `get_response` from botocore and changed the code such that the protocol is hard-coded to `query`. This way, when botocore changes the default protocol of SQS to JSON, kombu won't be impacted, since it crafts its own request and, after my change, it uses a hard-coded protocol based on the crafted requests. This solution shouldn't be the final solution, and it is more of a workaround that does the job for now. There are two problems with this approach: 1. It doesn't address the fundamental problem discussed in celery#1726, which is that `kombu` is using some stuff that are kind of internal to botocore, namely the `StreamingBody` class. 2. It is still making an assumption, namely that the protocol of communication is the `query` protocol. While this is true, and likely going to be true for some time, in theory nothing stops SQS (the backend, not client) from changing the default protocol, rendering the hard-coding of protocol in the new `get_response` method to `query` to be problematic. As such, the long term solution should be to completely rely on boto3 for any communication with AWS, and ensuring that all requests all async in nature (non-blocking.) This, however, is a fundamental change that requires a lot of testing, in particular performance testing.
TL;DR - The use of boto3 in celery#1759 resulted in relying on blocking (synchronous) HTTP requests, which caused the performance issue reported in celery#1783. `kombu` previously used to craft AWS requests manually as explained in detail in celery#1726, which resulted in an outage when botocore temporarily changed the default protocol to JSON (before rolling back due to the impact on celery and airflow.) To fix the issue, I submitted celery#1759, which changes `kombu` to use `boto3` instead of manually crafting AWS requests. This way when boto3 changes the default protocol, kombu won't be impacted. While working on celery#1759, I did extensive debugging to understand the multi-threading nature of kombu. What I discovered is that there isn't an actual multi-threading in the true sense of the word, but an event loop that runs on the same thread and process and orchestrate the communication with SQS. As such, it didn't appear to me that there is anything to worry about my change, and the testing I did didn't discover any issue. However, it turns out that while kombu's event loop doesn't have actual multi-threading, its [reliance on pycurl](https://github.com/celery/kombu/blob/main/kombu/asynchronous/http/curl.py#L48) (and thus libcurl) meant that the requests to AWS were being done asynchronously. On the other hand, boto3 requests are always done synchronously, i.e. they are blocking requests. The above meant that my change introduced blocking on the event loop of kombu. This is fine in most of the cases, since the requests to SQS are pretty fast. However, in the case of using long-polling, a call to SQS's ReceiveMessage can last up to 20 seconds (depending on the user configuration). To solve this problem, I rolled back my earlier changes and, instead, to address the issue reported in celery#1726, I now borrowed the implementation of `get_response` from botocore and changed the code such that the protocol is hard-coded to `query`. This way, when botocore changes the default protocol of SQS to JSON, kombu won't be impacted, since it crafts its own request and, after my change, it uses a hard-coded protocol based on the crafted requests. This solution shouldn't be the final solution, and it is more of a workaround that does the job for now. There are two problems with this approach: 1. It doesn't address the fundamental problem discussed in celery#1726, which is that `kombu` is using some stuff that are kind of internal to botocore, namely the `StreamingBody` class. 2. It is still making an assumption, namely that the protocol of communication is the `query` protocol. While this is true, and likely going to be true for some time, in theory nothing stops SQS (the backend, not client) from changing the default protocol, rendering the hard-coding of protocol in the new `get_response` method to `query` to be problematic. As such, the long term solution should be to completely rely on boto3 for any communication with AWS, and ensuring that all requests all async in nature (non-blocking.) This, however, is a fundamental change that requires a lot of testing, in particular performance testing.
TL;DR - The use of boto3 in celery#1759 resulted in relying on blocking (synchronous) HTTP requests, which caused the performance issue reported in celery#1783. `kombu` previously used to craft AWS requests manually as explained in detail in celery#1726, which resulted in an outage when botocore temporarily changed the default protocol to JSON (before rolling back due to the impact on celery and airflow.) To fix the issue, I submitted celery#1759, which changes `kombu` to use `boto3` instead of manually crafting AWS requests. This way when boto3 changes the default protocol, kombu won't be impacted. While working on celery#1759, I did extensive debugging to understand the multi-threading nature of kombu. What I discovered is that there isn't an actual multi-threading in the true sense of the word, but an event loop that runs on the same thread and process and orchestrate the communication with SQS. As such, it didn't appear to me that there is anything to worry about my change, and the testing I did didn't discover any issue. However, it turns out that while kombu's event loop doesn't have actual multi-threading, its [reliance on pycurl](https://github.com/celery/kombu/blob/main/kombu/asynchronous/http/curl.py#L48) (and thus libcurl) meant that the requests to AWS were being done asynchronously. On the other hand, boto3 requests are always done synchronously, i.e. they are blocking requests. The above meant that my change introduced blocking on the event loop of kombu. This is fine in most of the cases, since the requests to SQS are pretty fast. However, in the case of using long-polling, a call to SQS's ReceiveMessage can last up to 20 seconds (depending on the user configuration). To solve this problem, I rolled back my earlier changes and, instead, to address the issue reported in celery#1726, I now borrowed the implementation of `get_response` from botocore and changed the code such that the protocol is hard-coded to `query`. This way, when botocore changes the default protocol of SQS to JSON, kombu won't be impacted, since it crafts its own request and, after my change, it uses a hard-coded protocol based on the crafted requests. This solution shouldn't be the final solution, and it is more of a workaround that does the job for now. There are two problems with this approach: 1. It doesn't address the fundamental problem discussed in celery#1726, which is that `kombu` is using some stuff that are kind of internal to botocore, namely the `StreamingBody` class. 2. It is still making an assumption, namely that the protocol of communication is the `query` protocol. While this is true, and likely going to be true for some time, in theory nothing stops SQS (the backend, not client) from changing the default protocol, rendering the hard-coding of protocol in the new `get_response` method to `query` to be problematic. As such, the long term solution should be to completely rely on boto3 for any communication with AWS, and ensuring that all requests all async in nature (non-blocking.) This, however, is a fundamental change that requires a lot of testing, in particular performance testing.
TL;DR - The use of boto3 in celery#1759 resulted in relying on blocking (synchronous) HTTP requests, which caused the performance issue reported in celery#1783. `kombu` previously used to craft AWS requests manually as explained in detail in celery#1726, which resulted in an outage when botocore temporarily changed the default protocol to JSON (before rolling back due to the impact on celery and airflow.) To fix the issue, I submitted celery#1759, which changes `kombu` to use `boto3` instead of manually crafting AWS requests. This way when boto3 changes the default protocol, kombu won't be impacted. While working on celery#1759, I did extensive debugging to understand the multi-threading nature of kombu. What I discovered is that there isn't an actual multi-threading in the true sense of the word, but an event loop that runs on the same thread and process and orchestrate the communication with SQS. As such, it didn't appear to me that there is anything to worry about my change, and the testing I did didn't discover any issue. However, it turns out that while kombu's event loop doesn't have actual multi-threading, its [reliance on pycurl](https://github.com/celery/kombu/blob/main/kombu/asynchronous/http/curl.py#L48) (and thus libcurl) meant that the requests to AWS were being done asynchronously. On the other hand, boto3 requests are always done synchronously, i.e. they are blocking requests. The above meant that my change introduced blocking on the event loop of kombu. This is fine in most of the cases, since the requests to SQS are pretty fast. However, in the case of using long-polling, a call to SQS's ReceiveMessage can last up to 20 seconds (depending on the user configuration). To solve this problem, I rolled back my earlier changes and, instead, to address the issue reported in celery#1726, I now borrowed the implementation of `get_response` from botocore and changed the code such that the protocol is hard-coded to `query`. This way, when botocore changes the default protocol of SQS to JSON, kombu won't be impacted, since it crafts its own request and, after my change, it uses a hard-coded protocol based on the crafted requests. This solution shouldn't be the final solution, and it is more of a workaround that does the job for now. There are two problems with this approach: 1. It doesn't address the fundamental problem discussed in celery#1726, which is that `kombu` is using some stuff that are kind of internal to botocore, namely the `StreamingBody` class. 2. It is still making an assumption, namely that the protocol of communication is the `query` protocol. While this is true, and likely going to be true for some time, in theory nothing stops SQS (the backend, not client) from changing the default protocol, rendering the hard-coding of protocol in the new `get_response` method to `query` to be problematic. As such, the long term solution should be to completely rely on boto3 for any communication with AWS, and ensuring that all requests all async in nature (non-blocking.) This, however, is a fundamental change that requires a lot of testing, in particular performance testing.
TL;DR - The use of boto3 in celery#1759 resulted in relying on blocking (synchronous) HTTP requests, which caused the performance issue reported in celery#1783. `kombu` previously used to craft AWS requests manually as explained in detail in celery#1726, which resulted in an outage when botocore temporarily changed the default protocol to JSON (before rolling back due to the impact on celery and airflow.) To fix the issue, I submitted celery#1759, which changes `kombu` to use `boto3` instead of manually crafting AWS requests. This way when boto3 changes the default protocol, kombu won't be impacted. While working on celery#1759, I did extensive debugging to understand the multi-threading nature of kombu. What I discovered is that there isn't an actual multi-threading in the true sense of the word, but an event loop that runs on the same thread and process and orchestrate the communication with SQS. As such, it didn't appear to me that there is anything to worry about my change, and the testing I did didn't discover any issue. However, it turns out that while kombu's event loop doesn't have actual multi-threading, its [reliance on pycurl](https://github.com/celery/kombu/blob/main/kombu/asynchronous/http/curl.py#L48) (and thus libcurl) meant that the requests to AWS were being done asynchronously. On the other hand, boto3 requests are always done synchronously, i.e. they are blocking requests. The above meant that my change introduced blocking on the event loop of kombu. This is fine in most of the cases, since the requests to SQS are pretty fast. However, in the case of using long-polling, a call to SQS's ReceiveMessage can last up to 20 seconds (depending on the user configuration). To solve this problem, I rolled back my earlier changes and, instead, to address the issue reported in celery#1726, I now borrowed the implementation of `get_response` from botocore and changed the code such that the protocol is hard-coded to `query`. This way, when botocore changes the default protocol of SQS to JSON, kombu won't be impacted, since it crafts its own request and, after my change, it uses a hard-coded protocol based on the crafted requests. This solution shouldn't be the final solution, and it is more of a workaround that does the job for now. There are two problems with this approach: 1. It doesn't address the fundamental problem discussed in celery#1726, which is that `kombu` is using some stuff that are kind of internal to botocore, namely the `StreamingBody` class. 2. It is still making an assumption, namely that the protocol of communication is the `query` protocol. While this is true, and likely going to be true for some time, in theory nothing stops SQS (the backend, not client) from changing the default protocol, rendering the hard-coding of protocol in the new `get_response` method to `query` to be problematic. As such, the long term solution should be to completely rely on boto3 for any communication with AWS, and ensuring that all requests all async in nature (non-blocking.) This, however, is a fundamental change that requires a lot of testing, in particular performance testing.
TL;DR - The use of boto3 in celery#1759 resulted in relying on blocking (synchronous) HTTP requests, which caused the performance issue reported in celery#1783. `kombu` previously used to craft AWS requests manually as explained in detail in celery#1726, which resulted in an outage when botocore temporarily changed the default protocol to JSON (before rolling back due to the impact on celery and airflow.) To fix the issue, I submitted celery#1759, which changes `kombu` to use `boto3` instead of manually crafting AWS requests. This way when boto3 changes the default protocol, kombu won't be impacted. While working on celery#1759, I did extensive debugging to understand the multi-threading nature of kombu. What I discovered is that there isn't an actual multi-threading in the true sense of the word, but an event loop that runs on the same thread and process and orchestrate the communication with SQS. As such, it didn't appear to me that there is anything to worry about my change, and the testing I did didn't discover any issue. However, it turns out that while kombu's event loop doesn't have actual multi-threading, its [reliance on pycurl](https://github.com/celery/kombu/blob/main/kombu/asynchronous/http/curl.py#L48) (and thus libcurl) meant that the requests to AWS were being done asynchronously. On the other hand, boto3 requests are always done synchronously, i.e. they are blocking requests. The above meant that my change introduced blocking on the event loop of kombu. This is fine in most of the cases, since the requests to SQS are pretty fast. However, in the case of using long-polling, a call to SQS's ReceiveMessage can last up to 20 seconds (depending on the user configuration). To solve this problem, I rolled back my earlier changes and, instead, to address the issue reported in celery#1726, I now borrowed the implementation of `get_response` from botocore and changed the code such that the protocol is hard-coded to `query`. This way, when botocore changes the default protocol of SQS to JSON, kombu won't be impacted, since it crafts its own request and, after my change, it uses a hard-coded protocol based on the crafted requests. This solution shouldn't be the final solution, and it is more of a workaround that does the job for now. There are two problems with this approach: 1. It doesn't address the fundamental problem discussed in celery#1726, which is that `kombu` is using some stuff that are kind of internal to botocore, namely the `StreamingBody` class. 2. It is still making an assumption, namely that the protocol of communication is the `query` protocol. While this is true, and likely going to be true for some time, in theory nothing stops SQS (the backend, not client) from changing the default protocol, rendering the hard-coding of protocol in the new `get_response` method to `query` to be problematic. As such, the long term solution should be to completely rely on boto3 for any communication with AWS, and ensuring that all requests all async in nature (non-blocking.) This, however, is a fundamental change that requires a lot of testing, in particular performance testing.
TL;DR - The use of boto3 in celery#1759 resulted in relying on blocking (synchronous) HTTP requests, which caused the performance issue reported in celery#1783. `kombu` previously used to craft AWS requests manually as explained in detail in celery#1726, which resulted in an outage when botocore temporarily changed the default protocol to JSON (before rolling back due to the impact on celery and airflow.) To fix the issue, I submitted celery#1759, which changes `kombu` to use `boto3` instead of manually crafting AWS requests. This way when boto3 changes the default protocol, kombu won't be impacted. While working on celery#1759, I did extensive debugging to understand the multi-threading nature of kombu. What I discovered is that there isn't an actual multi-threading in the true sense of the word, but an event loop that runs on the same thread and process and orchestrate the communication with SQS. As such, it didn't appear to me that there is anything to worry about my change, and the testing I did didn't discover any issue. However, it turns out that while kombu's event loop doesn't have actual multi-threading, its [reliance on pycurl](https://github.com/celery/kombu/blob/main/kombu/asynchronous/http/curl.py#L48) (and thus libcurl) meant that the requests to AWS were being done asynchronously. On the other hand, boto3 requests are always done synchronously, i.e. they are blocking requests. The above meant that my change introduced blocking on the event loop of kombu. This is fine in most of the cases, since the requests to SQS are pretty fast. However, in the case of using long-polling, a call to SQS's ReceiveMessage can last up to 20 seconds (depending on the user configuration). To solve this problem, I rolled back my earlier changes and, instead, to address the issue reported in celery#1726, I now borrowed the implementation of `get_response` from botocore and changed the code such that the protocol is hard-coded to `query`. This way, when botocore changes the default protocol of SQS to JSON, kombu won't be impacted, since it crafts its own request and, after my change, it uses a hard-coded protocol based on the crafted requests. This solution shouldn't be the final solution, and it is more of a workaround that does the job for now. There are two problems with this approach: 1. It doesn't address the fundamental problem discussed in celery#1726, which is that `kombu` is using some stuff that are kind of internal to botocore, namely the `StreamingBody` class. 2. It is still making an assumption, namely that the protocol of communication is the `query` protocol. While this is true, and likely going to be true for some time, in theory nothing stops SQS (the backend, not client) from changing the default protocol, rendering the hard-coding of protocol in the new `get_response` method to `query` to be problematic. As such, the long term solution should be to completely rely on boto3 for any communication with AWS, and ensuring that all requests all async in nature (non-blocking.) This, however, is a fundamental change that requires a lot of testing, in particular performance testing.
TL;DR - The use of boto3 in celery#1759 resulted in relying on blocking (synchronous) HTTP requests, which caused the performance issue reported in celery#1783. `kombu` previously used to craft AWS requests manually as explained in detail in celery#1726, which resulted in an outage when botocore temporarily changed the default protocol to JSON (before rolling back due to the impact on celery and airflow.) To fix the issue, I submitted celery#1759, which changes `kombu` to use `boto3` instead of manually crafting AWS requests. This way when boto3 changes the default protocol, kombu won't be impacted. While working on celery#1759, I did extensive debugging to understand the multi-threading nature of kombu. What I discovered is that there isn't an actual multi-threading in the true sense of the word, but an event loop that runs on the same thread and process and orchestrate the communication with SQS. As such, it didn't appear to me that there is anything to worry about my change, and the testing I did didn't discover any issue. However, it turns out that while kombu's event loop doesn't have actual multi-threading, its [reliance on pycurl](https://github.com/celery/kombu/blob/main/kombu/asynchronous/http/curl.py#L48) (and thus libcurl) meant that the requests to AWS were being done asynchronously. On the other hand, boto3 requests are always done synchronously, i.e. they are blocking requests. The above meant that my change introduced blocking on the event loop of kombu. This is fine in most of the cases, since the requests to SQS are pretty fast. However, in the case of using long-polling, a call to SQS's ReceiveMessage can last up to 20 seconds (depending on the user configuration). To solve this problem, I rolled back my earlier changes and, instead, to address the issue reported in celery#1726, I now borrowed the implementation of `get_response` from botocore and changed the code such that the protocol is hard-coded to `query`. This way, when botocore changes the default protocol of SQS to JSON, kombu won't be impacted, since it crafts its own request and, after my change, it uses a hard-coded protocol based on the crafted requests. This solution shouldn't be the final solution, and it is more of a workaround that does the job for now. There are two problems with this approach: 1. It doesn't address the fundamental problem discussed in celery#1726, which is that `kombu` is using some stuff that are kind of internal to botocore, namely the `StreamingBody` class. 2. It is still making an assumption, namely that the protocol of communication is the `query` protocol. While this is true, and likely going to be true for some time, in theory nothing stops SQS (the backend, not client) from changing the default protocol, rendering the hard-coding of protocol in the new `get_response` method to `query` to be problematic. As such, the long term solution should be to completely rely on boto3 for any communication with AWS, and ensuring that all requests all async in nature (non-blocking.) This, however, is a fundamental change that requires a lot of testing, in particular performance testing.
TL;DR - The use of boto3 in celery#1759 resulted in relying on blocking (synchronous) HTTP requests, which caused the performance issue reported in celery#1783. `kombu` previously used to craft AWS requests manually as explained in detail in celery#1726, which resulted in an outage when botocore temporarily changed the default protocol to JSON (before rolling back due to the impact on celery and airflow.) To fix the issue, I submitted celery#1759, which changes `kombu` to use `boto3` instead of manually crafting AWS requests. This way when boto3 changes the default protocol, kombu won't be impacted. While working on celery#1759, I did extensive debugging to understand the multi-threading nature of kombu. What I discovered is that there isn't an actual multi-threading in the true sense of the word, but an event loop that runs on the same thread and process and orchestrate the communication with SQS. As such, it didn't appear to me that there is anything to worry about my change, and the testing I did didn't discover any issue. However, it turns out that while kombu's event loop doesn't have actual multi-threading, its [reliance on pycurl](https://github.com/celery/kombu/blob/main/kombu/asynchronous/http/curl.py#L48) (and thus libcurl) meant that the requests to AWS were being done asynchronously. On the other hand, boto3 requests are always done synchronously, i.e. they are blocking requests. The above meant that my change introduced blocking on the event loop of kombu. This is fine in most of the cases, since the requests to SQS are pretty fast. However, in the case of using long-polling, a call to SQS's ReceiveMessage can last up to 20 seconds (depending on the user configuration). To solve this problem, I rolled back my earlier changes and, instead, to address the issue reported in celery#1726, I now borrowed the implementation of `get_response` from botocore and changed the code such that the protocol is hard-coded to `query`. This way, when botocore changes the default protocol of SQS to JSON, kombu won't be impacted, since it crafts its own request and, after my change, it uses a hard-coded protocol based on the crafted requests. This solution shouldn't be the final solution, and it is more of a workaround that does the job for now. There are two problems with this approach: 1. It doesn't address the fundamental problem discussed in celery#1726, which is that `kombu` is using some stuff that are kind of internal to botocore, namely the `StreamingBody` class. 2. It is still making an assumption, namely that the protocol of communication is the `query` protocol. While this is true, and likely going to be true for some time, in theory nothing stops SQS (the backend, not client) from changing the default protocol, rendering the hard-coding of protocol in the new `get_response` method to `query` to be problematic. As such, the long term solution should be to completely rely on boto3 for any communication with AWS, and ensuring that all requests all async in nature (non-blocking.) This, however, is a fundamental change that requires a lot of testing, in particular performance testing.
TL;DR - The use of boto3 in celery#1759 resulted in relying on blocking (synchronous) HTTP requests, which caused the performance issue reported in celery#1783. `kombu` previously used to craft AWS requests manually as explained in detail in celery#1726, which resulted in an outage when botocore temporarily changed the default protocol to JSON (before rolling back due to the impact on celery and airflow.) To fix the issue, I submitted celery#1759, which changes `kombu` to use `boto3` instead of manually crafting AWS requests. This way when boto3 changes the default protocol, kombu won't be impacted. While working on celery#1759, I did extensive debugging to understand the multi-threading nature of kombu. What I discovered is that there isn't an actual multi-threading in the true sense of the word, but an event loop that runs on the same thread and process and orchestrate the communication with SQS. As such, it didn't appear to me that there is anything to worry about my change, and the testing I did didn't discover any issue. However, it turns out that while kombu's event loop doesn't have actual multi-threading, its [reliance on pycurl](https://github.com/celery/kombu/blob/main/kombu/asynchronous/http/curl.py#L48) (and thus libcurl) meant that the requests to AWS were being done asynchronously. On the other hand, boto3 requests are always done synchronously, i.e. they are blocking requests. The above meant that my change introduced blocking on the event loop of kombu. This is fine in most of the cases, since the requests to SQS are pretty fast. However, in the case of using long-polling, a call to SQS's ReceiveMessage can last up to 20 seconds (depending on the user configuration). To solve this problem, I rolled back my earlier changes and, instead, to address the issue reported in celery#1726, I now borrowed the implementation of `get_response` from botocore and changed the code such that the protocol is hard-coded to `query`. This way, when botocore changes the default protocol of SQS to JSON, kombu won't be impacted, since it crafts its own request and, after my change, it uses a hard-coded protocol based on the crafted requests. This solution shouldn't be the final solution, and it is more of a workaround that does the job for now. There are two problems with this approach: 1. It doesn't address the fundamental problem discussed in celery#1726, which is that `kombu` is using some stuff that are kind of internal to botocore, namely the `StreamingBody` class. 2. It is still making an assumption, namely that the protocol of communication is the `query` protocol. While this is true, and likely going to be true for some time, in theory nothing stops SQS (the backend, not client) from changing the default protocol, rendering the hard-coding of protocol in the new `get_response` method to `query` to be problematic. As such, the long term solution should be to completely rely on boto3 for any communication with AWS, and ensuring that all requests all async in nature (non-blocking.) This, however, is a fundamental change that requires a lot of testing, in particular performance testing.
TL;DR - The use of boto3 in celery#1759 resulted in relying on blocking (synchronous) HTTP requests, which caused the performance issue reported in celery#1783. `kombu` previously used to craft AWS requests manually as explained in detail in celery#1726, which resulted in an outage when botocore temporarily changed the default protocol to JSON (before rolling back due to the impact on celery and airflow.) To fix the issue, I submitted celery#1759, which changes `kombu` to use `boto3` instead of manually crafting AWS requests. This way when boto3 changes the default protocol, kombu won't be impacted. While working on celery#1759, I did extensive debugging to understand the multi-threading nature of kombu. What I discovered is that there isn't an actual multi-threading in the true sense of the word, but an event loop that runs on the same thread and process and orchestrate the communication with SQS. As such, it didn't appear to me that there is anything to worry about my change, and the testing I did didn't discover any issue. However, it turns out that while kombu's event loop doesn't have actual multi-threading, its [reliance on pycurl](https://github.com/celery/kombu/blob/main/kombu/asynchronous/http/curl.py#L48) (and thus libcurl) meant that the requests to AWS were being done asynchronously. On the other hand, boto3 requests are always done synchronously, i.e. they are blocking requests. The above meant that my change introduced blocking on the event loop of kombu. This is fine in most of the cases, since the requests to SQS are pretty fast. However, in the case of using long-polling, a call to SQS's ReceiveMessage can last up to 20 seconds (depending on the user configuration). To solve this problem, I rolled back my earlier changes and, instead, to address the issue reported in celery#1726, I now changed the `AsyncSQSConnection` class such that it crafts either a `query` or a `json` request depending on the protocol used by the SQS client. Thus, when botocore changes the default protocol of SQS to JSON, kombu won't be impacted, since it crafts its own request and, after my change, it uses a hard-coded protocol based on the crafted requests. This solution shouldn't be the final solution, and it is more of a workaround that does the job for now. The final solution should be to completely rely on boto3 for any communication with AWS, and ensuring that all requests are async in nature (non-blocking.) This, however, is a fundamental change that requires a lot of testing, in particular performance testing.
TL;DR - The use of boto3 in celery#1759 resulted in relying on blocking (synchronous) HTTP requests, which caused the performance issue reported in celery#1783. `kombu` previously used to craft AWS requests manually as explained in detail in celery#1726, which resulted in an outage when botocore temporarily changed the default protocol to JSON (before rolling back due to the impact on celery and airflow.) To fix the issue, I submitted celery#1759, which changes `kombu` to use `boto3` instead of manually crafting AWS requests. This way when boto3 changes the default protocol, kombu won't be impacted. While working on celery#1759, I did extensive debugging to understand the multi-threading nature of kombu. What I discovered is that there isn't an actual multi-threading in the true sense of the word, but an event loop that runs on the same thread and process and orchestrate the communication with SQS. As such, it didn't appear to me that there is anything to worry about my change, and the testing I did didn't discover any issue. However, it turns out that while kombu's event loop doesn't have actual multi-threading, its [reliance on pycurl](https://github.com/celery/kombu/blob/main/kombu/asynchronous/http/curl.py#L48) (and thus libcurl) meant that the requests to AWS were being done asynchronously. On the other hand, boto3 requests are always done synchronously, i.e. they are blocking requests. The above meant that my change introduced blocking on the event loop of kombu. This is fine in most of the cases, since the requests to SQS are pretty fast. However, in the case of using long-polling, a call to SQS's ReceiveMessage can last up to 20 seconds (depending on the user configuration). To solve this problem, I rolled back my earlier changes and, instead, to address the issue reported in celery#1726, I now changed the `AsyncSQSConnection` class such that it crafts either a `query` or a `json` request depending on the protocol used by the SQS client. Thus, when botocore changes the default protocol of SQS to JSON, kombu won't be impacted, since it crafts its own request and, after my change, it uses a hard-coded protocol based on the crafted requests. This solution shouldn't be the final solution, and it is more of a workaround that does the job for now. The final solution should be to completely rely on boto3 for any communication with AWS, and ensuring that all requests are async in nature (non-blocking.) This, however, is a fundamental change that requires a lot of testing, in particular performance testing.
TL;DR - The use of boto3 in celery#1759 resulted in relying on blocking (synchronous) HTTP requests, which caused the performance issue reported in celery#1783. `kombu` previously used to craft AWS requests manually as explained in detail in celery#1726, which resulted in an outage when botocore temporarily changed the default protocol to JSON (before rolling back due to the impact on celery and airflow.) To fix the issue, I submitted celery#1759, which changes `kombu` to use `boto3` instead of manually crafting AWS requests. This way when boto3 changes the default protocol, kombu won't be impacted. While working on celery#1759, I did extensive debugging to understand the multi-threading nature of kombu. What I discovered is that there isn't an actual multi-threading in the true sense of the word, but an event loop that runs on the same thread and process and orchestrate the communication with SQS. As such, it didn't appear to me that there is anything to worry about my change, and the testing I did didn't discover any issue. However, it turns out that while kombu's event loop doesn't have actual multi-threading, its [reliance on pycurl](https://github.com/celery/kombu/blob/main/kombu/asynchronous/http/curl.py#L48) (and thus libcurl) meant that the requests to AWS were being done asynchronously. On the other hand, boto3 requests are always done synchronously, i.e. they are blocking requests. The above meant that my change introduced blocking on the event loop of kombu. This is fine in most of the cases, since the requests to SQS are pretty fast. However, in the case of using long-polling, a call to SQS's ReceiveMessage can last up to 20 seconds (depending on the user configuration). To solve this problem, I rolled back my earlier changes and, instead, to address the issue reported in celery#1726, I now changed the `AsyncSQSConnection` class such that it crafts either a `query` or a `json` request depending on the protocol used by the SQS client. Thus, when botocore changes the default protocol of SQS to JSON, kombu won't be impacted, since it crafts its own request and, after my change, it uses a hard-coded protocol based on the crafted requests. This solution shouldn't be the final solution, and it is more of a workaround that does the job for now. The final solution should be to completely rely on boto3 for any communication with AWS, and ensuring that all requests are async in nature (non-blocking.) This, however, is a fundamental change that requires a lot of testing, in particular performance testing.
* Use the correct protocol for SQS requests TL;DR - The use of boto3 in #1759 resulted in relying on blocking (synchronous) HTTP requests, which caused the performance issue reported in #1783. `kombu` previously used to craft AWS requests manually as explained in detail in #1726, which resulted in an outage when botocore temporarily changed the default protocol to JSON (before rolling back due to the impact on celery and airflow.) To fix the issue, I submitted #1759, which changes `kombu` to use `boto3` instead of manually crafting AWS requests. This way when boto3 changes the default protocol, kombu won't be impacted. While working on #1759, I did extensive debugging to understand the multi-threading nature of kombu. What I discovered is that there isn't an actual multi-threading in the true sense of the word, but an event loop that runs on the same thread and process and orchestrate the communication with SQS. As such, it didn't appear to me that there is anything to worry about my change, and the testing I did didn't discover any issue. However, it turns out that while kombu's event loop doesn't have actual multi-threading, its [reliance on pycurl](https://github.com/celery/kombu/blob/main/kombu/asynchronous/http/curl.py#L48) (and thus libcurl) meant that the requests to AWS were being done asynchronously. On the other hand, boto3 requests are always done synchronously, i.e. they are blocking requests. The above meant that my change introduced blocking on the event loop of kombu. This is fine in most of the cases, since the requests to SQS are pretty fast. However, in the case of using long-polling, a call to SQS's ReceiveMessage can last up to 20 seconds (depending on the user configuration). To solve this problem, I rolled back my earlier changes and, instead, to address the issue reported in #1726, I now changed the `AsyncSQSConnection` class such that it crafts either a `query` or a `json` request depending on the protocol used by the SQS client. Thus, when botocore changes the default protocol of SQS to JSON, kombu won't be impacted, since it crafts its own request and, after my change, it uses a hard-coded protocol based on the crafted requests. This solution shouldn't be the final solution, and it is more of a workaround that does the job for now. The final solution should be to completely rely on boto3 for any communication with AWS, and ensuring that all requests are async in nature (non-blocking.) This, however, is a fundamental change that requires a lot of testing, in particular performance testing. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update kombu/asynchronous/aws/sqs/connection.py * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Asif Saif Uddin <auvipy@gmail.com>
Overview
Context on this can be found in #1726 , but briefly, AWS SQS is planning on a change for making JSON the default communication protocol instead of XML for their boto3 library. Currently,
kombu
manually craft the AWS request, which results in an XML response when it hits SQS, yet uses boto3 to parse the response. So, when the latter changes to JSON, it will break Celery when used with SQS (and, consequently, break Airflow when used with Celery and SQS.)This PR fixes the issue by using boto3 all the way, instead of just for parsing responses. Thus, when AWS SQS eventually changes the communication protocol for its boto3 client from XML to JSON, the requests will ask for JSON and the response will be parsed as JSON, thus avoiding the problem altogether.
Testing
To reproduce the issue, I created this simple Celery application:
worker.py
scheduler.py
Then I specifically installed botocore version
1.29.127
, which switched to JSON formatting (before it was rolled back in the next version) and confirmed the issue is happening. After my fix in this PR, the issue disappeared.