← All Posts
DatabasesPostgresSRESystem Design

The Connection Pool Limit Your Load Tests Never Hit

Sean Lobjoit··5 min read
View the summary on LinkedIn →

Most database incidents start with the same alert: p99 latency spiked, error rate climbed, and by the time anyone looks at the database it is already recovering. Teams dig through slow query logs, add indexes, tune vacuum settings. The root cause is usually none of that. It is connection pool exhaustion, and it is entirely preventable.

Why Postgres Connections Are Expensive

PostgreSQL allocates a dedicated backend process for every active connection. Each one consumes shared memory, maintains its own lock state, and participates in transaction snapshot management. A single connection costs roughly ~5-10 MB of RAM in overhead before it does anything useful.

The default max_connections on most managed Postgres instances sits somewhere between 100 and 1,000 depending on instance size. Teams treat that number as headroom. It is a resource ceiling. On an RDS db.r6g.xlarge (32 GB RAM), sustaining 800 active connections consumes 4-8 GB in connection overhead alone, before a single query runs. At that point, Postgres is spending CPU time managing connection state rather than executing queries.

The right mental model: Postgres connections are not lightweight sockets. They are OS processes. Treat them like processes.

The Failure Mode Is a Cliff, Not a Slope

This is what makes pool exhaustion dangerous. Traffic increases do not produce gradual performance degradation. They produce a vertical latency cliff.

If I were to map out the the full sequence sequence of events it would look something like this:

  1. It's 1:57, and traffic spikes to 3-5x normal load
  2. Each app server instance opens connections to handle incoming requests
  3. The connection count approaches max_connections
  4. New requests queue inside the application server waiting for a free connection
  5. Latency climbs from 80ms to 4,000ms in under two minutes
  6. Health checks start failing
  7. The incident fires at 2AM

The deceptive part: app server CPU looks fine, app server memory looks fine, and database CPU may even look low because connections are queued rather than running. Your standard load tests probably missed this because you tested at 2x traffic and did not hold load long enough for the pool to saturate.

How do you tell it's due to hitting a connection pool limit? Check for a vertical line on your p99 latency chart paired with a flood of idle in transaction connections in pg_stat_activity. Run this now if you have Postgres in production:

SELECT count(*), state
FROM pg_stat_activity
GROUP BY state
ORDER BY count DESC;

If you see hundreds of idle or idle in transaction rows, you have a pooling problem. You just have not had the incident yet.

The Fix: A Pooler Between Your App and Postgres

The solution is not raising max_connections. Higher limits mean Postgres manages more processes, which degrades performance at the exact moment you need it most. The solution is a connection pooler that multiplexes thousands of application connections through a small set of persistent backend connections.

PgBouncer (self-managed, any cloud)

[PgBouncer[(https://www.pgbouncer.org/) sits between your application and Postgres. In transaction pooling mode, it holds a backend connection only for the duration of a single transaction, then returns it to the pool. Five thousand application connections can share 80 backend connections without Postgres ever seeing the spike.

[pgbouncer]
pool_mode = transaction
max_client_conn = 5000
default_pool_size = 80
server_idle_timeout = 600
log_connections = 0

A reliable starting pool size formula you can use is: (num_cpu_cores * 2) + 1. For a 4-core instance, that is 9. Most teams are surprised by how small this number is, and equally surprised by how well it works.

RDS Proxy (AWS managed)

If you are on AWS, RDS Proxy handles pooling, connection multiplexing, and failover transparently. It adds 1-2ms of latency per query, which is almost always worth the operational simplicity. It also integrates with IAM auth and Secrets Manager, removing hardcoded credentials from your application config.

aws rds create-db-proxy \
  --db-proxy-name app-db-proxy \
  --engine-family POSTGRESQL \
  --auth '[{"AuthScheme":"SECRETS","SecretArn":"arn:aws:secretsmanager:ap-southeast-2:123456789:secret:db-creds","IAMAuth":"REQUIRED"}]' \
  --role-arn arn:aws:iam::123456789:role/rds-proxy-role \
  --vpc-subnet-ids subnet-abc123 subnet-def456 \
  --vpc-security-group-ids sg-xyz789

Both options work. PgBouncer gives you more control and zero additional AWS cost. RDS Proxy is the right call if you want managed failover and are already deep in the AWS ecosystem.

Real Outcomes

I worked with a client in the finance industry earlier this year that was hitting 503s every time a campaign email drove a traffic surge to their API. Their setup: max_connections = 500, ten app server instances each configured to open up to 50 connections. At peak, all 500 connections were open and new requests were queuing with nowhere to go.

The changes that were made were:

  • Deploying PgBouncer in transaction mode with default_pool_size = 80
  • Reducing app-level pool max from 50 to 10 per instance
  • Setting server_idle_timeout = 300 to reclaim stale backend connections

Results?

  • p99 latency dropped from 4,200ms to 95ms during traffic spikes
  • Zero 503s following that deployment
  • Downgraded the RDS instance class by one tier, saving $340/month

The implementation took one engineering day. The post-incident review document would have taken longer.

If you want to audit your database architecture before it becomes a production incident, book a strategy call and we can walk through your current setup.