Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

connection.reset don't check again DNS #558

Closed
gagalago opened this issue Feb 26, 2024 · 6 comments · Fixed by #559
Closed

connection.reset don't check again DNS #558

gagalago opened this issue Feb 26, 2024 · 6 comments · Fixed by #559

Comments

@gagalago
Copy link

gagalago commented Feb 26, 2024

when using AWS aurora, the database can stay on the same host when the IP behind has changed, in that case it can be interesting that the reset of the connection check again if we have to connect to the same IP as during the new or to a new IP for the same host as provided initially.

We can see this article that explain it in more details https://blog.50projects.com/2023/04/fixing-rails-stickiness.html and how it affect a rails application on top of AWS aurora

larskanis added a commit to larskanis/ruby-pg that referenced this issue Feb 28, 2024
libpq resolves the host by DNS while PQreset, but we don't.
This is because we explcit set the hostaddr connection parameter when the connection is established the first time.
This prevents a newly DNS resolution when running PQresetStart.

This patch adds DNS resolution to conn.reset
Since we can not change the connection parameters after connection start, the underlying PGconn pointer is exchanged in reset_start2.
This is done by a PQfinish() + PQconnectStart() sequence.
That way the hostaddr parameter is updated and a new connection is established with it.

Unfortunately there's no simple way to test the new behavior.
But I verified that it works by the following code:

```ruby
require "pg"

puts "pg version: #{PG::VERSION}"

system "sudo sed -i 's/.* abcd/::1 abcd/g' /etc/hosts"
conn = PG.connect host: "abcd", password: "l"
conn.exec("select 1")
p conn.conninfo_hash.slice(:host, :hostaddr, :port)

system "sudo sed -i 's/.* abcd/127.0.0.1 abcd/g' /etc/hosts"
conn.reset
conn.exec("select 1")
p conn.conninfo_hash.slice(:host, :hostaddr, :port)

system "sudo sed -i 's/.* abcd/::2 abcd/g' /etc/hosts"
conn.reset
conn.exec("select 1")
p conn.conninfo_hash.slice(:host, :hostaddr, :port)
```

This gives the following output showing, that the IP address is updated:
```
pg version: 1.5.5
{:host=>"abcd", :hostaddr=>"::1", :port=>"5432"}
{:host=>"abcd", :hostaddr=>"127.0.0.1", :port=>"5432"}
ruby-pg/lib/pg/connection.rb:573:in `reset_start2': connection to server at "::2", port 5432 failed: Network is unreachable (PG::ConnectionBad)
	Is the server running on that host and accepting TCP/IP connections?
```

Whereas libpq resolves similarly with `async_api=false`

```
pg version: 1.5.5
{:host=>"abcd", :hostaddr=>nil, :port=>"5432"}
{:host=>"abcd", :hostaddr=>nil, :port=>"5432"}
test-reset-dns.rb:18:in `sync_exec': no connection to the server (PG::UnableToSend)
```

Fixes ged#558
larskanis added a commit to larskanis/ruby-pg that referenced this issue Feb 28, 2024
larskanis added a commit to larskanis/ruby-pg that referenced this issue Feb 29, 2024
…his is because we explicit set the `hostaddr` connection parameter when the connection is established the first time. This prevents a newly DNS resolution when running PQresetStart.

This patch adds DNS resolution to `conn.reset`
Since we can not change the connection parameters after connection start, the underlying PGconn pointer is exchanged in reset_start2. This is done by a PQfinish() + PQconnectStart() sequence. That way the `hostaddr` parameter is updated and a new connection is established with it.

There is a `/etc/hosts` and `sudo` based test in the specs.
The behavior of libpq is slightly different to that of ruby-pg.
It can be verified by the following code:

```ruby
require "pg"

puts "pg version: #{PG::VERSION}"

system "sudo sed -i 's/.* abcd/::1 abcd/g' /etc/hosts"
conn = PG.connect host: "abcd", password: "l"
conn.exec("select 1")
p conn.conninfo_hash.slice(:host, :hostaddr, :port)

system "sudo sed -i 's/.* abcd/127.0.0.1 abcd/g' /etc/hosts"
conn.reset
conn.exec("select 1")
p conn.conninfo_hash.slice(:host, :hostaddr, :port)

system "sudo sed -i 's/.* abcd/::2 abcd/g' /etc/hosts"
conn.reset
conn.exec("select 1")
p conn.conninfo_hash.slice(:host, :hostaddr, :port)
```

This gives the following output showing, that the IP address is updated:
```
pg version: 1.5.5
{:host=>"abcd", :hostaddr=>"::1", :port=>"5432"}
{:host=>"abcd", :hostaddr=>"127.0.0.1", :port=>"5432"}
ruby-pg/lib/pg/connection.rb:573:in `reset_start2': connection to server at "::2", port 5432 failed: Network is unreachable (PG::ConnectionBad)
	Is the server running on that host and accepting TCP/IP connections?
```

Whereas libpq resolves similarly with `async_api=false`, but doesn't raise the error in `conn.reset` but in the subsequent `conn.exec`.

```
pg version: 1.5.5
{:host=>"abcd", :hostaddr=>nil, :port=>"5432"}
{:host=>"abcd", :hostaddr=>nil, :port=>"5432"}
test-reset-dns.rb:18:in `sync_exec': no connection to the server (PG::UnableToSend)
```

Fixes ged#558
larskanis added a commit to larskanis/ruby-pg that referenced this issue Feb 29, 2024
…his is because we explicit set the `hostaddr` connection parameter when the connection is established the first time. This prevents a newly DNS resolution when running PQresetStart.

This patch adds DNS resolution to `conn.reset`
Since we can not change the connection parameters after connection start, the underlying PGconn pointer is exchanged in reset_start2. This is done by a PQfinish() + PQconnectStart() sequence. That way the `hostaddr` parameter is updated and a new connection is established with it.

There is a `/etc/hosts` and `sudo` based test in the specs.
The behavior of libpq is slightly different to that of ruby-pg.
It can be verified by the following code:

```ruby
require "pg"

puts "pg version: #{PG::VERSION}"

system "sudo sed -i 's/.* abcd/::1 abcd/g' /etc/hosts"
conn = PG.connect host: "abcd", password: "l"
conn.exec("select 1")
p conn.conninfo_hash.slice(:host, :hostaddr, :port)

system "sudo sed -i 's/.* abcd/127.0.0.1 abcd/g' /etc/hosts"
conn.reset
conn.exec("select 1")
p conn.conninfo_hash.slice(:host, :hostaddr, :port)

system "sudo sed -i 's/.* abcd/::2 abcd/g' /etc/hosts"
conn.reset
conn.exec("select 1")
p conn.conninfo_hash.slice(:host, :hostaddr, :port)
```

This gives the following output showing, that the IP address is updated:
```
pg version: 1.5.5
{:host=>"abcd", :hostaddr=>"::1", :port=>"5432"}
{:host=>"abcd", :hostaddr=>"127.0.0.1", :port=>"5432"}
ruby-pg/lib/pg/connection.rb:573:in `reset_start2': connection to server at "::2", port 5432 failed: Network is unreachable (PG::ConnectionBad)
	Is the server running on that host and accepting TCP/IP connections?
```

Whereas libpq resolves similarly with `async_api=false`, but doesn't raise the error in `conn.reset` but in the subsequent `conn.exec`.

```
pg version: 1.5.5
{:host=>"abcd", :hostaddr=>nil, :port=>"5432"}
{:host=>"abcd", :hostaddr=>nil, :port=>"5432"}
test-reset-dns.rb:18:in `sync_exec': no connection to the server (PG::UnableToSend)
```

Fixes ged#558
@larskanis
Copy link
Collaborator

The blog post is almost one year old, until this issue is raised here in the issue tracker. 😄
Nevertheless I propose a fix in #559 . Would you mind to test it out?

@bnferguson
Copy link

Oh how timely is this?! I've been seeing a similar and related issue with a few setups (eg Nutanix, Aurora, EDB) where a failover where the server doesn't go away (just disconnects clients) or goes and comes back causes issues due to reconnecting to the old IP. This has caused a few outages as all other services (mostly Golang) seem to pick up on the change, while Ruby based ones stuck on the old IP.

I had a patch that's basically a simplified version of what that blog post lays out but while seeing about some upstream fixes found this issue and the PR! 🎉

@larskanis I tested your PR with my test setup I'd been using to dig into this problem and it fixes the issue completely. I'll comment over on the PR with more test details!

@larskanis
Copy link
Collaborator

pg-1.5.6 is released, fixing this issue. 🎉

@bnferguson
Copy link

Oh wow! That was so fast! Even got in before my release deadline today! 😂

Thank you so much!

@larskanis
Copy link
Collaborator

Doesn't happen that often, but in this case I was just waiting for a confirmation in order to make a release.

@gagalago
Copy link
Author

gagalago commented Mar 4, 2024

thank you all

and sorry for the late reply

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants