Hitting "Timed out fetching a new connection from the connection pool" error #22252

arnmishra · 2023-12-05T00:04:26Z

arnmishra
Dec 5, 2023

Question

I've read pretty much any prisma documentation, github issue, q&a discussion, stack overflow, etc on this but haven't been able to come up with what is causing our issue.

We have been seeing the following error Timed out fetching a new connection from the connection pool. More info: http://pris.ly/d/connection-pool (Current connection pool timeout: 10, connection limit: 5). This is happening on a process running in a google cloud run container. I understand that this means we are running out of available connections. We are using prisma client 5 on a postgres db in cloud sql.

We have a set of 5 cloud run instances that all speak to our postgres database. All 5 have their own prisma client that has a connection pool with a default limit of 5 (2 cores). The failures all have been happening so far in the same cloud run service that runs async jobs for us. This service runs a small typescript app that triggers jobs using BullMQ (https://docs.bullmq.io/).

In the course of 17 attempts to trigger a job in the last 3 days:

13 had no issues
3 failed but worked on 1 retry
1 failed on all 3 retries and only worked after a 4th manual try
(all failures were the same prisma client timeout)

Every time the query that fails is on the very first query made to the db in that container. At the time of the failure, there is no other activity in that container, and only a very minimal light load, if any, on the other 4 cloud run services (which all have their own prisma connection pool). Therefore, I can't imagine what would cause a timeout. I thought it may have to do with cold starts somehow but we've configured all our cloud run containers with a minimum of 1 instance always running.

How to reproduce (optional)

No response

Expected behavior (optional)

No response

Information about Prisma Schema, Client Queries and Environment (optional)

datasource db {
  provider = "postgresql"
  url      = env("POSTGRES_URL")
}

generator client {
  provider        = "prisma-client-js"
}

The query its failing on is a simple findUniqueOrThrow

Database: PostgresQL
Node.js version: 18.16
Run prisma -v to see your Prisma version and paste it
5.2.0

arnmishra · 2023-12-05T00:59:15Z

arnmishra
Dec 5, 2023
Author

more info, I added a log of the metrics on the prisma client after the prisma query.

prismaClient.$on('query'), (e) => prismaClient.$metrics.json().then((metrics) => console.dir(metrics, { depth: Infinity })))

then i ran the job again. it failed on attempt 1, passed on attempt 2 (exponential backoff delay of 1 second). on the first attempt, 0 queries were successfully made so nothing logged (it timed out on the very first query attempt). on the 2nd attempt, all the queries were made. here are the metrics after the 1st query of the successful run:

first query:

{
  counters: [
    {
      key: 'prisma_client_queries_total',
      labels: {},
      value: 1,
      description: 'Total number of Prisma Client queries executed',
    },
    {
      key: 'prisma_datasource_queries_total',
      labels: {},
      value: 1,
      description: 'Total number of Datasource Queries executed',
    },
    {
      key: 'prisma_pool_connections_open',
      labels: {},
      value: 1,
      description: 'Number of currently open Pool Connections',
    },
  ],
  gauges: [
    {
      key: 'prisma_client_queries_active',
      labels: {},
      value: 0,
      description: 'Number of currently active Prisma Client queries',
    },
    {
      key: 'prisma_client_queries_wait',
      labels: {},
      value: 0,
      description: 'Number of queries currently waiting for a connection',
    },
    {
      key: 'prisma_pool_connections_busy',
      labels: {},
      value: 1,
      description: 'Number of currently busy Pool Connections (executing a database query)',
    },
    {
      key: 'prisma_pool_connections_idle',
      labels: {},
      value: 4,
      description:
        'Number of currently unused Pool Connections (waiting for the next pool query to run)',
    },
    {
      key: 'prisma_pool_connections_opened_total',
      labels: {},
      value: 1,
      description: 'Total number of Pool Connections opened',
    },
  ],
  histograms: [
    {
      key: 'prisma_client_queries_wait_histogram_ms',
      labels: {},
      value: {
        buckets: [[0, 0], [1, 2], [5, 0], [10, 0], [50, 0], [100, 0], [500, 0], [1000, 0], [5000, 0], [50000, 0]],
        sum: 0.002211,
        count: 2,
      },
      description: 'Histogram of the wait time of all queries in ms',
    },
    {
      key: 'prisma_datasource_queries_duration_histogram_ms',
      labels: {},
      value: {
        buckets: [[0, 0], [1, 0], [5, 0], [10, 0], [50, 1], [100, 0], [500, 0], [1000, 0], [5000, 0], [50000, 0]],
        sum: 12.715552,
        count: 1,
      },
      description: 'Histogram of the duration of all executed Datasource Queries in ms',
    },
  ],
}

as you can see, its the 1st connection pool used, nothing is waiting on the pool, and it all runs 1s after the initial failed run. the only place where 2 runs is indicated is in the prisma_client_queries_wait_histogram_ms where it says there were 2 runs - my guess is that this is a race condition since the 2nd query may have already started while the metrics logging was finishing

the query that was run is the following (same query that failed to run while waiting for the connection pool). in prisma its just a prismaClient.action.findUniqueOrThrow({ where: { id: 'insert-our-uuid' } }):

Query: SELECT "public"."Action"."id", "public"."Action"."createdAt", "public"."Action"."updatedAt", "public"."Action"."description", "public"."Action"."name", "public"."Action"."order", "public"."Action"."status", "public"."Action"."workflowId" FROM "public"."Action" WHERE ("public"."Action"."id" = $1 AND 1=1) LIMIT $2 OFFSET $3
Params: ["[insert-our-uuid"],1,0]
Duration: 12ms

0 replies

arnmishra · 2023-12-05T01:21:33Z

arnmishra
Dec 5, 2023
Author

upon a little more debugging, I found that another cloud run service makes a bunch of db calls at the same time. At the time of the example failure above, 28 db calls are made from that other service - each calls takes 1-7ms averaging around 2.5ms. none of them fail with any issues and the requests are never stuck waiting as far as I can tell (or if so its imperceptible, all these queries combined finish in <1s).

I can't imagine how this would cause any of the failures above given this is using a different connection pool (also with 5 limit and 10s timeout) and nothing fails in that service (nor has it ever as far as I know). Just calling it out because its more information

0 replies

ludralph · 2023-12-05T08:00:41Z

ludralph
Dec 5, 2023
Collaborator

Hi @arnmishra 👋

Thank you for taking the time to share all the details about your issue! This seems very strange indeed but we’re happy to look into this together with you 🙏.

Could you try to increase the database connection_limit? Also, what is the actual limit of database connections that’s configured on the database level?

Every time the query that fails is on the very first query made to the db in that container. At the time of the failure, there is no other activity in that container, and only a very minimal light load, if any, on the other 4 cloud run services.

We find this to be weird because we do not expect a connection pool timeout when there hasn't been any activity.

Have you considered increasing the pool_timeout parameter? Can you also check Cloud SQL postgres database's performance metrics and limits to see if you are not hitting maximum connection limits?

0 replies

arnmishra · 2023-12-05T17:15:47Z

arnmishra
Dec 5, 2023
Author

@ludralph I'm happy to increase the timeout or pool size but I didn't want to blindly do that given it doesn't feel like we should be dealing with these timeouts. The operation it keeps timing out on is pretty time sensitive so just adding more delay felt like a dangerous solution for us in this case.

Looking at the cloud sql postgres stats, everything seems relatively standard (no issues with query latency, CPU, etc.). The connections to the db do peak at the time of the failure (from a usual 6-12 connections it peaks at 13 at the time of the failure). However, if I run SELECT * FROM pg_settings WHERE name = 'max_connections' I see the setting is a max of 400 connections to the db (and that is also documented here for an 8gb memory db https://cloud.google.com/sql/docs/postgres/flags#postgres-m).

Honestly there is no way for it to get anywhere near that 400 number given even if we had all 5 cloud run services at max usage, the connection pools only allow 5 connections each so we'd peak at 25 connections.

This screenshot shows the connections to the db, its the dossStagingDb (the teal line) and the time of failure was ~4:37PM.

22 replies

RoopeHakulinen Jan 14, 2024

@arnmishra did disabling CPU throttling fix it for you?

arnmishra Jan 14, 2024
Author

yep that was it

RoopeHakulinen Jan 14, 2024

Thanks for confirming. I'm experiencing this also with v2 Cloud Functions that use Cloud Run under the hood but don't seem to provide an option to set the CPU throttling.

arnmishra Jan 14, 2024
Author

just to check, did you see the stackoverflow suggestions?

^that's the setting on cloud run that I used, not sure if cloud functions have the same settings accessible

RoopeHakulinen Jan 14, 2024

Yeah, it is just that it is not available on Cloud Functions (at least via Terraform).. Thanks anyways!

arnmishra · 2023-12-07T00:12:03Z

arnmishra
Dec 7, 2023
Author

Is there any way to get eyes on this a little more urgently? Very risky situation for us and I can't imagine what is going wrong here and communicating once per day feels like it may take a while - happy to be available later in the day if that's helpful with timezones

1 reply

dominikabieder May 15, 2024

any news on that? I'm having a similar problem, tired a few things, but that didn't help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hitting "Timed out fetching a new connection from the connection pool" error #22252

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 23 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Hitting "Timed out fetching a new connection from the connection pool" error #22252

arnmishra Dec 5, 2023

Question

How to reproduce (optional)

Expected behavior (optional)

Information about Prisma Schema, Client Queries and Environment (optional)

Replies: 5 comments · 23 replies

arnmishra Dec 5, 2023 Author

arnmishra Dec 5, 2023 Author

ludralph Dec 5, 2023 Collaborator

arnmishra Dec 5, 2023 Author

RoopeHakulinen Jan 14, 2024

arnmishra Jan 14, 2024 Author

RoopeHakulinen Jan 14, 2024

arnmishra Jan 14, 2024 Author

RoopeHakulinen Jan 14, 2024

arnmishra Dec 7, 2023 Author

dominikabieder May 15, 2024

arnmishra
Dec 5, 2023

Replies: 5 comments 23 replies

arnmishra
Dec 5, 2023
Author

arnmishra
Dec 5, 2023
Author

ludralph
Dec 5, 2023
Collaborator

arnmishra
Dec 5, 2023
Author

arnmishra Jan 14, 2024
Author

arnmishra Jan 14, 2024
Author

arnmishra
Dec 7, 2023
Author