Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix #1726] Use boto3 for SQS async requests #1759

Merged
merged 2 commits into from Jun 27, 2023
Merged

Conversation

rafidka
Copy link
Contributor

@rafidka rafidka commented Jun 22, 2023

Overview

Context on this can be found in #1726 , but briefly, AWS SQS is planning on a change for making JSON the default communication protocol instead of XML for their boto3 library. Currently, kombu manually craft the AWS request, which results in an XML response when it hits SQS, yet uses boto3 to parse the response. So, when the latter changes to JSON, it will break Celery when used with SQS (and, consequently, break Airflow when used with Celery and SQS.)

This PR fixes the issue by using boto3 all the way, instead of just for parsing responses. Thus, when AWS SQS eventually changes the communication protocol for its boto3 client from XML to JSON, the requests will ask for JSON and the response will be parsed as JSON, thus avoiding the problem altogether.

Testing

To reproduce the issue, I created this simple Celery application:

worker.py

from celery import Celery

app = Celery('worker', broker='sqs://')

app.config_from_object
app.conf.update(
    broker_transport_options={
        'region': 'us-west-2',
        'visibility_timeout': 300,
        'polling_interval': 1,
        'predefined_queues': {
            'celery': {
                'url': 'https://sqs.us-west-2.amazonaws.com/<account ID>/celery-test',
            }
        }
    },
)


@app.task(acks_late=False)
def add(x, y):
    print(x + y)
    return x + y

scheduler.py

import time

from worker import add

while True:
    print("Scheduling task...")
    add.delay(4, 4)
    time.sleep(3)

Then I specifically installed botocore version 1.29.127, which switched to JSON formatting (before it was rolled back in the next version) and confirmed the issue is happening. After my fix in this PR, the issue disappeared.

Copy link
Member

@auvipy auvipy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please check the build failures?

@rafidka
Copy link
Contributor Author

rafidka commented Jun 22, 2023

can you please check the build failures?

Yeah, did notice the lint broke. I will fix it. Meanwhile, I would still appreciate feedback on the code. Thanks.

@auvipy
Copy link
Member

auvipy commented Jun 22, 2023

it is not lint broke :) its a test failure =================================== FAILURES ===================================
_________________________ test_Channel.test_get_async __________________________

self = <t.unit.transport.test_SQS.test_Channel object at 0x7f7f958ff610>

@pytest.mark.usefixtures('hub')
def test_get_async(self):
    """Basic coverage of async code typically used via:
    basic_consume > _loop1 > _schedule_queue > _get_bulk_async"""
    # Prepare
    for i in range(3):
        message = 'message: %s' % i
        self.producer.publish(message)

    # SQS.Channel.asynsqs constructs AsyncSQSConnection using self.sqs
    # which is already a mock thanks to `setup` above, we just need to
    # mock the async-specific methods (as test_AsyncSQSConnection does)
    async_sqs_conn = self.channel.asynsqs(self.queue_name)
    async_sqs_conn.get_list = Mock(name='X.get_list')

    # Call key method
    self.channel._get_bulk_async(self.queue_name)
  assert async_sqs_conn.get_list.call_count == 1

E AssertionError: assert 0 == 1
E + where 0 = .call_count
E + where = <kombu.asynchronous.aws.sqs.connection.AsyncSQSConnection object at 0x7f7f82b4a090>.get_list

@rafidka
Copy link
Contributor Author

rafidka commented Jun 22, 2023

it is not lint broke :) its a test failure

Ah, ok, sorry I missed that. I ran tests locally, but I had to scope it to some of the files only as I started getting some Azure related errors. @auvipy , what is the easiest way to run all tests locally?

@rafidka
Copy link
Contributor Author

rafidka commented Jun 23, 2023

@auvipy , I fixed the build issues. Could you please take a look?

@auvipy
Copy link
Member

auvipy commented Jun 23, 2023

yes, thanks! I am going to review it

Copy link
Member

@auvipy auvipy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the codes look good. can you also check in the docs to see if anything need adjustment? thanks for your effort here

@rafidka
Copy link
Contributor Author

rafidka commented Jun 23, 2023

the codes look good. can you also check in the docs to see if anything need adjustment? thanks for your effort here

Sure, will do.

@auvipy auvipy added this to the 5.3.x milestone Jun 26, 2023
@rafidka rafidka force-pushed the fix-1726 branch 2 times, most recently from e117893 to f85d2c7 Compare June 26, 2023 13:13
@rafidka
Copy link
Contributor Author

rafidka commented Jun 26, 2023

the codes look good. can you also check in the docs to see if anything need adjustment? thanks for your effort here

Hi @auvipy . I checked the documentation under the doc/ folder and I didn't see anything related to the methods I removed so we should be good. I also rebased my branch and fixed a couple of issues the linter was complaining about. Finally, 3 members from my team also reviewed the change and they think the change looks good. Please let me know if there is anything else that I can help with.

One important thing I would like to mention is regarding the methods I removed. I confirmed that those methods are not being used within Celery, but I cannot confirm they are not referenced outside. However, since those methods are not documented, and they are kind of internal methods, they are not supposed to be used by others, and if that's the case, then the using code should be updated accordingly, much like how we are updating kombu to avoid using botocore's internal stuff. Please let me your thoughts on this.

@rafidka rafidka requested a review from auvipy June 26, 2023 13:55
@auvipy
Copy link
Member

auvipy commented Jun 27, 2023

I agree with your view points, I am going to review it and merge when feel confident. thanks for taking on this

@auvipy auvipy merged commit 862d0bc into celery:main Jun 27, 2023
14 checks passed
@auvipy
Copy link
Member

auvipy commented Jun 27, 2023

Thanks a lot Rafid!

@auvipy
Copy link
Member

auvipy commented Sep 18, 2023

we got a report on performance regression here #1783 (comment). would be thankful to have this verified

auvipy added a commit that referenced this pull request Sep 26, 2023
@auvipy
Copy link
Member

auvipy commented Sep 26, 2023

we might need to revert the change #1799

@rafidka
Copy link
Contributor Author

rafidka commented Sep 26, 2023

@auvipy , I don't think this is a good idea. Reverting the change is a ticking bomb that will result in kombu being unusable completely when botocore is updated. Performance degradation seems the lesser of the two evils, assuming it is indeed caused by my PR. What is the timeline for the new release? I can try to find sometime to debug the issue to see if I can root cause the performance degradation issue.

@auvipy
Copy link
Member

auvipy commented Sep 27, 2023

@auvipy , I don't think this is a good idea. Reverting the change is a ticking bomb that will result in kombu being unusable completely when botocore is updated. Performance degradation seems the lesser of the two evils, assuming it is indeed caused by my PR. What is the timeline for the new release? I can try to find sometime to debug the issue to see if I can root cause the performance degradation issue.

reverting is the last resort. you can take your time to re investigate the issue in coming weeks.

@rafidka
Copy link
Contributor Author

rafidka commented Sep 27, 2023

@auvipy , I don't think this is a good idea. Reverting the change is a ticking bomb that will result in kombu being unusable completely when botocore is updated. Performance degradation seems the lesser of the two evils, assuming it is indeed caused by my PR. What is the timeline for the new release? I can try to find sometime to debug the issue to see if I can root cause the performance degradation issue.

reverting is the last resort. you can take your time to re investigate the issue in coming weeks.

@auvipy , what is the time frame we are talking about? I have an important deadline on October 18. If that's too far, I will try to find someone else in my team to take a look at this.

@auvipy
Copy link
Member

auvipy commented Sep 27, 2023

That is a little bit too late. We want to cut a release this month, for bug fix stuff. But since it is a big change we can stop reverting for now and wait till 30 Oct, 2023 to find a Solution.

@rafidka
Copy link
Contributor Author

rafidka commented Sep 29, 2023

We think that there indeed is an issue with Pull Request, and my colleague and I are actively working on root causing and fixing this issue. We will keep you updated.

@rafidka
Copy link
Contributor Author

rafidka commented Oct 4, 2023

@auvipy , I found the root cause behind the performance degradation. Please see my comment here.

rafidka added a commit to rafidka/kombu that referenced this pull request Oct 6, 2023
rafidka added a commit to rafidka/kombu that referenced this pull request Oct 9, 2023
TL;DR - The use of boto3 in celery#1759 resulted in relying on blocking
(synchronous) HTTP requests, which caused the performance issue reported
in celery#1783.

`kombu` previously used to craft AWS requests manually as explained in
detail in celery#1726, which resulted in an outage when botocore temporarily
changed the default protocol to JSON (before rolling back due to the
impact on celery and airflow.) To fix the issue, I submitted celery#1759,
which changes `kombu` to use `boto3` instead of manually crafting AWS
requests. This way when boto3 changes the default protocol, kombu won't
be impacted.

While working on celery#1759, I did extensive debugging to understand the
multi-threading nature of kombu. What I discovered is that there isn't
an actual multi-threading in the true sense of the word, but an event
loop that runs on the same thread and process and orchestrate the
communication with SQS. As such, it didn't appear to me that there is
anything to worry about my change, and the testing I did didn't discover
any issue. However, it turns out that while kombu's event loop doesn't
have actual multi-threading, its [reliance on
pycurl](https://github.com/celery/kombu/blob/main/kombu/asynchronous/http/curl.py#L48)
(and thus libcurl) meant that the requests to AWS were being done
asynchronously. On the other hand, boto3 requests are always done
synchronously, i.e. they are blocking requests.

The above meant that my change introduced blocking on the event loop of
kombu. This is fine in most of the cases, since the requests to SQS are
pretty fast. However, in the case of using long-polling, a call to SQS's
ReceiveMessage can last up to 20 seconds (depending on the user
configuration).

To solve this problem, I rolled back my earlier changes and, instead, to
address the issue reported in celery#1726, I now borrowed the implementation
of `get_response` from botocore and changed the code such that the
protocol is hard-coded to `query`. This way, when botocore changes the
default protocol of SQS to JSON, kombu won't be impacted, since it
crafts its own request and, after my change, it uses a hard-coded
protocol based on the crafted requests.

This solution shouldn't be the final solution, and it is more of a
workaround that does the job for now. There are two problems with this
approach:

1. It doesn't address the fundamental problem discussed in celery#1726, which
   is that `kombu` is using some stuff that are kind of internal to
   botocore, namely the `StreamingBody` class.
2. It is still making an assumption, namely that the protocol of
   communication is the `query` protocol. While this is true, and likely
   going to be true for some time, in theory nothing stops SQS (the
   backend, not client) from changing the default protocol, rendering
   the hard-coding of protocol in the new `get_response` method to
   `query` to be problematic.

As such, the long term solution should be to completely rely on boto3
for any communication with AWS, and ensuring that all requests all async
in nature (non-blocking.) This, however, is a fundamental change that
requires a lot of testing, in particular performance testing.
rafidka added a commit to rafidka/kombu that referenced this pull request Oct 9, 2023
TL;DR - The use of boto3 in celery#1759 resulted in relying on blocking
(synchronous) HTTP requests, which caused the performance issue reported
in celery#1783.

`kombu` previously used to craft AWS requests manually as explained in
detail in celery#1726, which resulted in an outage when botocore temporarily
changed the default protocol to JSON (before rolling back due to the
impact on celery and airflow.) To fix the issue, I submitted celery#1759,
which changes `kombu` to use `boto3` instead of manually crafting AWS
requests. This way when boto3 changes the default protocol, kombu won't
be impacted.

While working on celery#1759, I did extensive debugging to understand the
multi-threading nature of kombu. What I discovered is that there isn't
an actual multi-threading in the true sense of the word, but an event
loop that runs on the same thread and process and orchestrate the
communication with SQS. As such, it didn't appear to me that there is
anything to worry about my change, and the testing I did didn't discover
any issue. However, it turns out that while kombu's event loop doesn't
have actual multi-threading, its [reliance on
pycurl](https://github.com/celery/kombu/blob/main/kombu/asynchronous/http/curl.py#L48)
(and thus libcurl) meant that the requests to AWS were being done
asynchronously. On the other hand, boto3 requests are always done
synchronously, i.e. they are blocking requests.

The above meant that my change introduced blocking on the event loop of
kombu. This is fine in most of the cases, since the requests to SQS are
pretty fast. However, in the case of using long-polling, a call to SQS's
ReceiveMessage can last up to 20 seconds (depending on the user
configuration).

To solve this problem, I rolled back my earlier changes and, instead, to
address the issue reported in celery#1726, I now borrowed the implementation
of `get_response` from botocore and changed the code such that the
protocol is hard-coded to `query`. This way, when botocore changes the
default protocol of SQS to JSON, kombu won't be impacted, since it
crafts its own request and, after my change, it uses a hard-coded
protocol based on the crafted requests.

This solution shouldn't be the final solution, and it is more of a
workaround that does the job for now. There are two problems with this
approach:

1. It doesn't address the fundamental problem discussed in celery#1726, which
   is that `kombu` is using some stuff that are kind of internal to
   botocore, namely the `StreamingBody` class.
2. It is still making an assumption, namely that the protocol of
   communication is the `query` protocol. While this is true, and likely
   going to be true for some time, in theory nothing stops SQS (the
   backend, not client) from changing the default protocol, rendering
   the hard-coding of protocol in the new `get_response` method to
   `query` to be problematic.

As such, the long term solution should be to completely rely on boto3
for any communication with AWS, and ensuring that all requests all async
in nature (non-blocking.) This, however, is a fundamental change that
requires a lot of testing, in particular performance testing.
rafidka added a commit to rafidka/kombu that referenced this pull request Oct 9, 2023
TL;DR - The use of boto3 in celery#1759 resulted in relying on blocking
(synchronous) HTTP requests, which caused the performance issue reported
in celery#1783.

`kombu` previously used to craft AWS requests manually as explained in
detail in celery#1726, which resulted in an outage when botocore temporarily
changed the default protocol to JSON (before rolling back due to the
impact on celery and airflow.) To fix the issue, I submitted celery#1759,
which changes `kombu` to use `boto3` instead of manually crafting AWS
requests. This way when boto3 changes the default protocol, kombu won't
be impacted.

While working on celery#1759, I did extensive debugging to understand the
multi-threading nature of kombu. What I discovered is that there isn't
an actual multi-threading in the true sense of the word, but an event
loop that runs on the same thread and process and orchestrate the
communication with SQS. As such, it didn't appear to me that there is
anything to worry about my change, and the testing I did didn't discover
any issue. However, it turns out that while kombu's event loop doesn't
have actual multi-threading, its [reliance on
pycurl](https://github.com/celery/kombu/blob/main/kombu/asynchronous/http/curl.py#L48)
(and thus libcurl) meant that the requests to AWS were being done
asynchronously. On the other hand, boto3 requests are always done
synchronously, i.e. they are blocking requests.

The above meant that my change introduced blocking on the event loop of
kombu. This is fine in most of the cases, since the requests to SQS are
pretty fast. However, in the case of using long-polling, a call to SQS's
ReceiveMessage can last up to 20 seconds (depending on the user
configuration).

To solve this problem, I rolled back my earlier changes and, instead, to
address the issue reported in celery#1726, I now borrowed the implementation
of `get_response` from botocore and changed the code such that the
protocol is hard-coded to `query`. This way, when botocore changes the
default protocol of SQS to JSON, kombu won't be impacted, since it
crafts its own request and, after my change, it uses a hard-coded
protocol based on the crafted requests.

This solution shouldn't be the final solution, and it is more of a
workaround that does the job for now. There are two problems with this
approach:

1. It doesn't address the fundamental problem discussed in celery#1726, which
   is that `kombu` is using some stuff that are kind of internal to
   botocore, namely the `StreamingBody` class.
2. It is still making an assumption, namely that the protocol of
   communication is the `query` protocol. While this is true, and likely
   going to be true for some time, in theory nothing stops SQS (the
   backend, not client) from changing the default protocol, rendering
   the hard-coding of protocol in the new `get_response` method to
   `query` to be problematic.

As such, the long term solution should be to completely rely on boto3
for any communication with AWS, and ensuring that all requests all async
in nature (non-blocking.) This, however, is a fundamental change that
requires a lot of testing, in particular performance testing.
rafidka added a commit to rafidka/kombu that referenced this pull request Oct 9, 2023
TL;DR - The use of boto3 in celery#1759 resulted in relying on blocking
(synchronous) HTTP requests, which caused the performance issue reported
in celery#1783.

`kombu` previously used to craft AWS requests manually as explained in
detail in celery#1726, which resulted in an outage when botocore temporarily
changed the default protocol to JSON (before rolling back due to the
impact on celery and airflow.) To fix the issue, I submitted celery#1759,
which changes `kombu` to use `boto3` instead of manually crafting AWS
requests. This way when boto3 changes the default protocol, kombu won't
be impacted.

While working on celery#1759, I did extensive debugging to understand the
multi-threading nature of kombu. What I discovered is that there isn't
an actual multi-threading in the true sense of the word, but an event
loop that runs on the same thread and process and orchestrate the
communication with SQS. As such, it didn't appear to me that there is
anything to worry about my change, and the testing I did didn't discover
any issue. However, it turns out that while kombu's event loop doesn't
have actual multi-threading, its [reliance on
pycurl](https://github.com/celery/kombu/blob/main/kombu/asynchronous/http/curl.py#L48)
(and thus libcurl) meant that the requests to AWS were being done
asynchronously. On the other hand, boto3 requests are always done
synchronously, i.e. they are blocking requests.

The above meant that my change introduced blocking on the event loop of
kombu. This is fine in most of the cases, since the requests to SQS are
pretty fast. However, in the case of using long-polling, a call to SQS's
ReceiveMessage can last up to 20 seconds (depending on the user
configuration).

To solve this problem, I rolled back my earlier changes and, instead, to
address the issue reported in celery#1726, I now borrowed the implementation
of `get_response` from botocore and changed the code such that the
protocol is hard-coded to `query`. This way, when botocore changes the
default protocol of SQS to JSON, kombu won't be impacted, since it
crafts its own request and, after my change, it uses a hard-coded
protocol based on the crafted requests.

This solution shouldn't be the final solution, and it is more of a
workaround that does the job for now. There are two problems with this
approach:

1. It doesn't address the fundamental problem discussed in celery#1726, which
   is that `kombu` is using some stuff that are kind of internal to
   botocore, namely the `StreamingBody` class.
2. It is still making an assumption, namely that the protocol of
   communication is the `query` protocol. While this is true, and likely
   going to be true for some time, in theory nothing stops SQS (the
   backend, not client) from changing the default protocol, rendering
   the hard-coding of protocol in the new `get_response` method to
   `query` to be problematic.

As such, the long term solution should be to completely rely on boto3
for any communication with AWS, and ensuring that all requests all async
in nature (non-blocking.) This, however, is a fundamental change that
requires a lot of testing, in particular performance testing.
rafidka added a commit to rafidka/kombu that referenced this pull request Oct 9, 2023
TL;DR - The use of boto3 in celery#1759 resulted in relying on blocking
(synchronous) HTTP requests, which caused the performance issue reported
in celery#1783.

`kombu` previously used to craft AWS requests manually as explained in
detail in celery#1726, which resulted in an outage when botocore temporarily
changed the default protocol to JSON (before rolling back due to the
impact on celery and airflow.) To fix the issue, I submitted celery#1759,
which changes `kombu` to use `boto3` instead of manually crafting AWS
requests. This way when boto3 changes the default protocol, kombu won't
be impacted.

While working on celery#1759, I did extensive debugging to understand the
multi-threading nature of kombu. What I discovered is that there isn't
an actual multi-threading in the true sense of the word, but an event
loop that runs on the same thread and process and orchestrate the
communication with SQS. As such, it didn't appear to me that there is
anything to worry about my change, and the testing I did didn't discover
any issue. However, it turns out that while kombu's event loop doesn't
have actual multi-threading, its [reliance on
pycurl](https://github.com/celery/kombu/blob/main/kombu/asynchronous/http/curl.py#L48)
(and thus libcurl) meant that the requests to AWS were being done
asynchronously. On the other hand, boto3 requests are always done
synchronously, i.e. they are blocking requests.

The above meant that my change introduced blocking on the event loop of
kombu. This is fine in most of the cases, since the requests to SQS are
pretty fast. However, in the case of using long-polling, a call to SQS's
ReceiveMessage can last up to 20 seconds (depending on the user
configuration).

To solve this problem, I rolled back my earlier changes and, instead, to
address the issue reported in celery#1726, I now borrowed the implementation
of `get_response` from botocore and changed the code such that the
protocol is hard-coded to `query`. This way, when botocore changes the
default protocol of SQS to JSON, kombu won't be impacted, since it
crafts its own request and, after my change, it uses a hard-coded
protocol based on the crafted requests.

This solution shouldn't be the final solution, and it is more of a
workaround that does the job for now. There are two problems with this
approach:

1. It doesn't address the fundamental problem discussed in celery#1726, which
   is that `kombu` is using some stuff that are kind of internal to
   botocore, namely the `StreamingBody` class.
2. It is still making an assumption, namely that the protocol of
   communication is the `query` protocol. While this is true, and likely
   going to be true for some time, in theory nothing stops SQS (the
   backend, not client) from changing the default protocol, rendering
   the hard-coding of protocol in the new `get_response` method to
   `query` to be problematic.

As such, the long term solution should be to completely rely on boto3
for any communication with AWS, and ensuring that all requests all async
in nature (non-blocking.) This, however, is a fundamental change that
requires a lot of testing, in particular performance testing.
auvipy added a commit that referenced this pull request Oct 10, 2023
rafidka added a commit to rafidka/kombu that referenced this pull request Oct 11, 2023
TL;DR - The use of boto3 in celery#1759 resulted in relying on blocking
(synchronous) HTTP requests, which caused the performance issue reported
in celery#1783.

`kombu` previously used to craft AWS requests manually as explained in
detail in celery#1726, which resulted in an outage when botocore temporarily
changed the default protocol to JSON (before rolling back due to the
impact on celery and airflow.) To fix the issue, I submitted celery#1759,
which changes `kombu` to use `boto3` instead of manually crafting AWS
requests. This way when boto3 changes the default protocol, kombu won't
be impacted.

While working on celery#1759, I did extensive debugging to understand the
multi-threading nature of kombu. What I discovered is that there isn't
an actual multi-threading in the true sense of the word, but an event
loop that runs on the same thread and process and orchestrate the
communication with SQS. As such, it didn't appear to me that there is
anything to worry about my change, and the testing I did didn't discover
any issue. However, it turns out that while kombu's event loop doesn't
have actual multi-threading, its [reliance on
pycurl](https://github.com/celery/kombu/blob/main/kombu/asynchronous/http/curl.py#L48)
(and thus libcurl) meant that the requests to AWS were being done
asynchronously. On the other hand, boto3 requests are always done
synchronously, i.e. they are blocking requests.

The above meant that my change introduced blocking on the event loop of
kombu. This is fine in most of the cases, since the requests to SQS are
pretty fast. However, in the case of using long-polling, a call to SQS's
ReceiveMessage can last up to 20 seconds (depending on the user
configuration).

To solve this problem, I rolled back my earlier changes and, instead, to
address the issue reported in celery#1726, I now borrowed the implementation
of `get_response` from botocore and changed the code such that the
protocol is hard-coded to `query`. This way, when botocore changes the
default protocol of SQS to JSON, kombu won't be impacted, since it
crafts its own request and, after my change, it uses a hard-coded
protocol based on the crafted requests.

This solution shouldn't be the final solution, and it is more of a
workaround that does the job for now. There are two problems with this
approach:

1. It doesn't address the fundamental problem discussed in celery#1726, which
   is that `kombu` is using some stuff that are kind of internal to
   botocore, namely the `StreamingBody` class.
2. It is still making an assumption, namely that the protocol of
   communication is the `query` protocol. While this is true, and likely
   going to be true for some time, in theory nothing stops SQS (the
   backend, not client) from changing the default protocol, rendering
   the hard-coding of protocol in the new `get_response` method to
   `query` to be problematic.

As such, the long term solution should be to completely rely on boto3
for any communication with AWS, and ensuring that all requests all async
in nature (non-blocking.) This, however, is a fundamental change that
requires a lot of testing, in particular performance testing.
rafidka added a commit to rafidka/kombu that referenced this pull request Oct 14, 2023
TL;DR - The use of boto3 in celery#1759 resulted in relying on blocking
(synchronous) HTTP requests, which caused the performance issue reported
in celery#1783.

`kombu` previously used to craft AWS requests manually as explained in
detail in celery#1726, which resulted in an outage when botocore temporarily
changed the default protocol to JSON (before rolling back due to the
impact on celery and airflow.) To fix the issue, I submitted celery#1759,
which changes `kombu` to use `boto3` instead of manually crafting AWS
requests. This way when boto3 changes the default protocol, kombu won't
be impacted.

While working on celery#1759, I did extensive debugging to understand the
multi-threading nature of kombu. What I discovered is that there isn't
an actual multi-threading in the true sense of the word, but an event
loop that runs on the same thread and process and orchestrate the
communication with SQS. As such, it didn't appear to me that there is
anything to worry about my change, and the testing I did didn't discover
any issue. However, it turns out that while kombu's event loop doesn't
have actual multi-threading, its [reliance on
pycurl](https://github.com/celery/kombu/blob/main/kombu/asynchronous/http/curl.py#L48)
(and thus libcurl) meant that the requests to AWS were being done
asynchronously. On the other hand, boto3 requests are always done
synchronously, i.e. they are blocking requests.

The above meant that my change introduced blocking on the event loop of
kombu. This is fine in most of the cases, since the requests to SQS are
pretty fast. However, in the case of using long-polling, a call to SQS's
ReceiveMessage can last up to 20 seconds (depending on the user
configuration).

To solve this problem, I rolled back my earlier changes and, instead, to
address the issue reported in celery#1726, I now borrowed the implementation
of `get_response` from botocore and changed the code such that the
protocol is hard-coded to `query`. This way, when botocore changes the
default protocol of SQS to JSON, kombu won't be impacted, since it
crafts its own request and, after my change, it uses a hard-coded
protocol based on the crafted requests.

This solution shouldn't be the final solution, and it is more of a
workaround that does the job for now. There are two problems with this
approach:

1. It doesn't address the fundamental problem discussed in celery#1726, which
   is that `kombu` is using some stuff that are kind of internal to
   botocore, namely the `StreamingBody` class.
2. It is still making an assumption, namely that the protocol of
   communication is the `query` protocol. While this is true, and likely
   going to be true for some time, in theory nothing stops SQS (the
   backend, not client) from changing the default protocol, rendering
   the hard-coding of protocol in the new `get_response` method to
   `query` to be problematic.

As such, the long term solution should be to completely rely on boto3
for any communication with AWS, and ensuring that all requests all async
in nature (non-blocking.) This, however, is a fundamental change that
requires a lot of testing, in particular performance testing.
rafidka added a commit to rafidka/kombu that referenced this pull request Oct 14, 2023
TL;DR - The use of boto3 in celery#1759 resulted in relying on blocking
(synchronous) HTTP requests, which caused the performance issue reported
in celery#1783.

`kombu` previously used to craft AWS requests manually as explained in
detail in celery#1726, which resulted in an outage when botocore temporarily
changed the default protocol to JSON (before rolling back due to the
impact on celery and airflow.) To fix the issue, I submitted celery#1759,
which changes `kombu` to use `boto3` instead of manually crafting AWS
requests. This way when boto3 changes the default protocol, kombu won't
be impacted.

While working on celery#1759, I did extensive debugging to understand the
multi-threading nature of kombu. What I discovered is that there isn't
an actual multi-threading in the true sense of the word, but an event
loop that runs on the same thread and process and orchestrate the
communication with SQS. As such, it didn't appear to me that there is
anything to worry about my change, and the testing I did didn't discover
any issue. However, it turns out that while kombu's event loop doesn't
have actual multi-threading, its [reliance on
pycurl](https://github.com/celery/kombu/blob/main/kombu/asynchronous/http/curl.py#L48)
(and thus libcurl) meant that the requests to AWS were being done
asynchronously. On the other hand, boto3 requests are always done
synchronously, i.e. they are blocking requests.

The above meant that my change introduced blocking on the event loop of
kombu. This is fine in most of the cases, since the requests to SQS are
pretty fast. However, in the case of using long-polling, a call to SQS's
ReceiveMessage can last up to 20 seconds (depending on the user
configuration).

To solve this problem, I rolled back my earlier changes and, instead, to
address the issue reported in celery#1726, I now borrowed the implementation
of `get_response` from botocore and changed the code such that the
protocol is hard-coded to `query`. This way, when botocore changes the
default protocol of SQS to JSON, kombu won't be impacted, since it
crafts its own request and, after my change, it uses a hard-coded
protocol based on the crafted requests.

This solution shouldn't be the final solution, and it is more of a
workaround that does the job for now. There are two problems with this
approach:

1. It doesn't address the fundamental problem discussed in celery#1726, which
   is that `kombu` is using some stuff that are kind of internal to
   botocore, namely the `StreamingBody` class.
2. It is still making an assumption, namely that the protocol of
   communication is the `query` protocol. While this is true, and likely
   going to be true for some time, in theory nothing stops SQS (the
   backend, not client) from changing the default protocol, rendering
   the hard-coding of protocol in the new `get_response` method to
   `query` to be problematic.

As such, the long term solution should be to completely rely on boto3
for any communication with AWS, and ensuring that all requests all async
in nature (non-blocking.) This, however, is a fundamental change that
requires a lot of testing, in particular performance testing.
rafidka added a commit to rafidka/kombu that referenced this pull request Oct 14, 2023
TL;DR - The use of boto3 in celery#1759 resulted in relying on blocking
(synchronous) HTTP requests, which caused the performance issue reported
in celery#1783.

`kombu` previously used to craft AWS requests manually as explained in
detail in celery#1726, which resulted in an outage when botocore temporarily
changed the default protocol to JSON (before rolling back due to the
impact on celery and airflow.) To fix the issue, I submitted celery#1759,
which changes `kombu` to use `boto3` instead of manually crafting AWS
requests. This way when boto3 changes the default protocol, kombu won't
be impacted.

While working on celery#1759, I did extensive debugging to understand the
multi-threading nature of kombu. What I discovered is that there isn't
an actual multi-threading in the true sense of the word, but an event
loop that runs on the same thread and process and orchestrate the
communication with SQS. As such, it didn't appear to me that there is
anything to worry about my change, and the testing I did didn't discover
any issue. However, it turns out that while kombu's event loop doesn't
have actual multi-threading, its [reliance on
pycurl](https://github.com/celery/kombu/blob/main/kombu/asynchronous/http/curl.py#L48)
(and thus libcurl) meant that the requests to AWS were being done
asynchronously. On the other hand, boto3 requests are always done
synchronously, i.e. they are blocking requests.

The above meant that my change introduced blocking on the event loop of
kombu. This is fine in most of the cases, since the requests to SQS are
pretty fast. However, in the case of using long-polling, a call to SQS's
ReceiveMessage can last up to 20 seconds (depending on the user
configuration).

To solve this problem, I rolled back my earlier changes and, instead, to
address the issue reported in celery#1726, I now borrowed the implementation
of `get_response` from botocore and changed the code such that the
protocol is hard-coded to `query`. This way, when botocore changes the
default protocol of SQS to JSON, kombu won't be impacted, since it
crafts its own request and, after my change, it uses a hard-coded
protocol based on the crafted requests.

This solution shouldn't be the final solution, and it is more of a
workaround that does the job for now. There are two problems with this
approach:

1. It doesn't address the fundamental problem discussed in celery#1726, which
   is that `kombu` is using some stuff that are kind of internal to
   botocore, namely the `StreamingBody` class.
2. It is still making an assumption, namely that the protocol of
   communication is the `query` protocol. While this is true, and likely
   going to be true for some time, in theory nothing stops SQS (the
   backend, not client) from changing the default protocol, rendering
   the hard-coding of protocol in the new `get_response` method to
   `query` to be problematic.

As such, the long term solution should be to completely rely on boto3
for any communication with AWS, and ensuring that all requests all async
in nature (non-blocking.) This, however, is a fundamental change that
requires a lot of testing, in particular performance testing.
rafidka added a commit to rafidka/kombu that referenced this pull request Oct 15, 2023
TL;DR - The use of boto3 in celery#1759 resulted in relying on blocking
(synchronous) HTTP requests, which caused the performance issue reported
in celery#1783.

`kombu` previously used to craft AWS requests manually as explained in
detail in celery#1726, which resulted in an outage when botocore temporarily
changed the default protocol to JSON (before rolling back due to the
impact on celery and airflow.) To fix the issue, I submitted celery#1759,
which changes `kombu` to use `boto3` instead of manually crafting AWS
requests. This way when boto3 changes the default protocol, kombu won't
be impacted.

While working on celery#1759, I did extensive debugging to understand the
multi-threading nature of kombu. What I discovered is that there isn't
an actual multi-threading in the true sense of the word, but an event
loop that runs on the same thread and process and orchestrate the
communication with SQS. As such, it didn't appear to me that there is
anything to worry about my change, and the testing I did didn't discover
any issue. However, it turns out that while kombu's event loop doesn't
have actual multi-threading, its [reliance on
pycurl](https://github.com/celery/kombu/blob/main/kombu/asynchronous/http/curl.py#L48)
(and thus libcurl) meant that the requests to AWS were being done
asynchronously. On the other hand, boto3 requests are always done
synchronously, i.e. they are blocking requests.

The above meant that my change introduced blocking on the event loop of
kombu. This is fine in most of the cases, since the requests to SQS are
pretty fast. However, in the case of using long-polling, a call to SQS's
ReceiveMessage can last up to 20 seconds (depending on the user
configuration).

To solve this problem, I rolled back my earlier changes and, instead, to
address the issue reported in celery#1726, I now borrowed the implementation
of `get_response` from botocore and changed the code such that the
protocol is hard-coded to `query`. This way, when botocore changes the
default protocol of SQS to JSON, kombu won't be impacted, since it
crafts its own request and, after my change, it uses a hard-coded
protocol based on the crafted requests.

This solution shouldn't be the final solution, and it is more of a
workaround that does the job for now. There are two problems with this
approach:

1. It doesn't address the fundamental problem discussed in celery#1726, which
   is that `kombu` is using some stuff that are kind of internal to
   botocore, namely the `StreamingBody` class.
2. It is still making an assumption, namely that the protocol of
   communication is the `query` protocol. While this is true, and likely
   going to be true for some time, in theory nothing stops SQS (the
   backend, not client) from changing the default protocol, rendering
   the hard-coding of protocol in the new `get_response` method to
   `query` to be problematic.

As such, the long term solution should be to completely rely on boto3
for any communication with AWS, and ensuring that all requests all async
in nature (non-blocking.) This, however, is a fundamental change that
requires a lot of testing, in particular performance testing.
rafidka added a commit to rafidka/kombu that referenced this pull request Oct 17, 2023
TL;DR - The use of boto3 in celery#1759 resulted in relying on blocking
(synchronous) HTTP requests, which caused the performance issue reported
in celery#1783.

`kombu` previously used to craft AWS requests manually as explained in
detail in celery#1726, which resulted in an outage when botocore temporarily
changed the default protocol to JSON (before rolling back due to the
impact on celery and airflow.) To fix the issue, I submitted celery#1759,
which changes `kombu` to use `boto3` instead of manually crafting AWS
requests. This way when boto3 changes the default protocol, kombu won't
be impacted.

While working on celery#1759, I did extensive debugging to understand the
multi-threading nature of kombu. What I discovered is that there isn't
an actual multi-threading in the true sense of the word, but an event
loop that runs on the same thread and process and orchestrate the
communication with SQS. As such, it didn't appear to me that there is
anything to worry about my change, and the testing I did didn't discover
any issue. However, it turns out that while kombu's event loop doesn't
have actual multi-threading, its [reliance on
pycurl](https://github.com/celery/kombu/blob/main/kombu/asynchronous/http/curl.py#L48)
(and thus libcurl) meant that the requests to AWS were being done
asynchronously. On the other hand, boto3 requests are always done
synchronously, i.e. they are blocking requests.

The above meant that my change introduced blocking on the event loop of
kombu. This is fine in most of the cases, since the requests to SQS are
pretty fast. However, in the case of using long-polling, a call to SQS's
ReceiveMessage can last up to 20 seconds (depending on the user
configuration).

To solve this problem, I rolled back my earlier changes and, instead, to
address the issue reported in celery#1726, I now changed the
`AsyncSQSConnection` class such that it crafts either a `query` or a
`json` request depending on the protocol used by the SQS client. Thus,
when botocore changes the default protocol of SQS to JSON, kombu won't
be impacted, since it crafts its own request and, after my change, it
uses a hard-coded protocol based on the crafted requests.

This solution shouldn't be the final solution, and it is more of a
workaround that does the job for now. The final solution should be to
completely rely on boto3 for any communication with AWS, and ensuring
that all requests are async in nature (non-blocking.) This, however, is
a fundamental change that requires a lot of testing, in particular
performance testing.
rafidka added a commit to rafidka/kombu that referenced this pull request Oct 23, 2023
TL;DR - The use of boto3 in celery#1759 resulted in relying on blocking
(synchronous) HTTP requests, which caused the performance issue reported
in celery#1783.

`kombu` previously used to craft AWS requests manually as explained in
detail in celery#1726, which resulted in an outage when botocore temporarily
changed the default protocol to JSON (before rolling back due to the
impact on celery and airflow.) To fix the issue, I submitted celery#1759,
which changes `kombu` to use `boto3` instead of manually crafting AWS
requests. This way when boto3 changes the default protocol, kombu won't
be impacted.

While working on celery#1759, I did extensive debugging to understand the
multi-threading nature of kombu. What I discovered is that there isn't
an actual multi-threading in the true sense of the word, but an event
loop that runs on the same thread and process and orchestrate the
communication with SQS. As such, it didn't appear to me that there is
anything to worry about my change, and the testing I did didn't discover
any issue. However, it turns out that while kombu's event loop doesn't
have actual multi-threading, its [reliance on
pycurl](https://github.com/celery/kombu/blob/main/kombu/asynchronous/http/curl.py#L48)
(and thus libcurl) meant that the requests to AWS were being done
asynchronously. On the other hand, boto3 requests are always done
synchronously, i.e. they are blocking requests.

The above meant that my change introduced blocking on the event loop of
kombu. This is fine in most of the cases, since the requests to SQS are
pretty fast. However, in the case of using long-polling, a call to SQS's
ReceiveMessage can last up to 20 seconds (depending on the user
configuration).

To solve this problem, I rolled back my earlier changes and, instead, to
address the issue reported in celery#1726, I now changed the
`AsyncSQSConnection` class such that it crafts either a `query` or a
`json` request depending on the protocol used by the SQS client. Thus,
when botocore changes the default protocol of SQS to JSON, kombu won't
be impacted, since it crafts its own request and, after my change, it
uses a hard-coded protocol based on the crafted requests.

This solution shouldn't be the final solution, and it is more of a
workaround that does the job for now. The final solution should be to
completely rely on boto3 for any communication with AWS, and ensuring
that all requests are async in nature (non-blocking.) This, however, is
a fundamental change that requires a lot of testing, in particular
performance testing.
rafidka added a commit to rafidka/kombu that referenced this pull request Oct 23, 2023
TL;DR - The use of boto3 in celery#1759 resulted in relying on blocking
(synchronous) HTTP requests, which caused the performance issue reported
in celery#1783.

`kombu` previously used to craft AWS requests manually as explained in
detail in celery#1726, which resulted in an outage when botocore temporarily
changed the default protocol to JSON (before rolling back due to the
impact on celery and airflow.) To fix the issue, I submitted celery#1759,
which changes `kombu` to use `boto3` instead of manually crafting AWS
requests. This way when boto3 changes the default protocol, kombu won't
be impacted.

While working on celery#1759, I did extensive debugging to understand the
multi-threading nature of kombu. What I discovered is that there isn't
an actual multi-threading in the true sense of the word, but an event
loop that runs on the same thread and process and orchestrate the
communication with SQS. As such, it didn't appear to me that there is
anything to worry about my change, and the testing I did didn't discover
any issue. However, it turns out that while kombu's event loop doesn't
have actual multi-threading, its [reliance on
pycurl](https://github.com/celery/kombu/blob/main/kombu/asynchronous/http/curl.py#L48)
(and thus libcurl) meant that the requests to AWS were being done
asynchronously. On the other hand, boto3 requests are always done
synchronously, i.e. they are blocking requests.

The above meant that my change introduced blocking on the event loop of
kombu. This is fine in most of the cases, since the requests to SQS are
pretty fast. However, in the case of using long-polling, a call to SQS's
ReceiveMessage can last up to 20 seconds (depending on the user
configuration).

To solve this problem, I rolled back my earlier changes and, instead, to
address the issue reported in celery#1726, I now changed the
`AsyncSQSConnection` class such that it crafts either a `query` or a
`json` request depending on the protocol used by the SQS client. Thus,
when botocore changes the default protocol of SQS to JSON, kombu won't
be impacted, since it crafts its own request and, after my change, it
uses a hard-coded protocol based on the crafted requests.

This solution shouldn't be the final solution, and it is more of a
workaround that does the job for now. The final solution should be to
completely rely on boto3 for any communication with AWS, and ensuring
that all requests are async in nature (non-blocking.) This, however, is
a fundamental change that requires a lot of testing, in particular
performance testing.
auvipy added a commit that referenced this pull request Nov 16, 2023
* Use the correct protocol for SQS requests

TL;DR - The use of boto3 in #1759 resulted in relying on blocking
(synchronous) HTTP requests, which caused the performance issue reported
in #1783.

`kombu` previously used to craft AWS requests manually as explained in
detail in #1726, which resulted in an outage when botocore temporarily
changed the default protocol to JSON (before rolling back due to the
impact on celery and airflow.) To fix the issue, I submitted #1759,
which changes `kombu` to use `boto3` instead of manually crafting AWS
requests. This way when boto3 changes the default protocol, kombu won't
be impacted.

While working on #1759, I did extensive debugging to understand the
multi-threading nature of kombu. What I discovered is that there isn't
an actual multi-threading in the true sense of the word, but an event
loop that runs on the same thread and process and orchestrate the
communication with SQS. As such, it didn't appear to me that there is
anything to worry about my change, and the testing I did didn't discover
any issue. However, it turns out that while kombu's event loop doesn't
have actual multi-threading, its [reliance on
pycurl](https://github.com/celery/kombu/blob/main/kombu/asynchronous/http/curl.py#L48)
(and thus libcurl) meant that the requests to AWS were being done
asynchronously. On the other hand, boto3 requests are always done
synchronously, i.e. they are blocking requests.

The above meant that my change introduced blocking on the event loop of
kombu. This is fine in most of the cases, since the requests to SQS are
pretty fast. However, in the case of using long-polling, a call to SQS's
ReceiveMessage can last up to 20 seconds (depending on the user
configuration).

To solve this problem, I rolled back my earlier changes and, instead, to
address the issue reported in #1726, I now changed the
`AsyncSQSConnection` class such that it crafts either a `query` or a
`json` request depending on the protocol used by the SQS client. Thus,
when botocore changes the default protocol of SQS to JSON, kombu won't
be impacted, since it crafts its own request and, after my change, it
uses a hard-coded protocol based on the crafted requests.

This solution shouldn't be the final solution, and it is more of a
workaround that does the job for now. The final solution should be to
completely rely on boto3 for any communication with AWS, and ensuring
that all requests are async in nature (non-blocking.) This, however, is
a fundamental change that requires a lot of testing, in particular
performance testing.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update kombu/asynchronous/aws/sqs/connection.py

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Asif Saif Uddin <auvipy@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants