Auth service returns 502 on token refresh under sustained load
Issue Details
Description
The authentication service intermittently returns HTTP 502 Bad Gateway when handling token refresh requests under sustained concurrent load (>200 req/s). This causes downstream services to fail silently and users to be logged out unexpectedly.
Steps to reproduce:
- Run k6 load test script at 200+ VUs targeting
/api/auth/refresh - Observe 502 responses beginning at ~180 req/s sustained for >30 seconds
- Check auth service logs — connection pool exhaustion visible at
pool_size=10
Expected behavior: Token refresh succeeds with <200ms p99 latency up to 500 req/s.
Actual behavior: 502 errors begin at 180 req/s, error rate reaches 40% at 250 req/s. DB connection pool exhausted — max connections set to 10, should be 50+.
ERROR auth-service: dial tcp: connection refused (pool exhausted)
pool_max=10 pool_idle=0 pool_waiting=47 latency_p99=4821ms
WARN token-refresh: upstream returned 502, retrying (attempt 3/3)
Sub-tasks
1/4 completeActivity
Confirmed reproduction. Pool size in staging is 10 — production config says 10 as well. This needs to be bumped to at least 50. Also need to add connection timeout handling so requests fail fast instead of queuing.
Moving to Blocked — cannot proceed until DB migration (ALPHA-298) is resolved. The migration changes the connection pooling table structure.
feat(auth): add k6 load test reproducing 502 under 200VU — branch: fix/ALPHA-341-load-test
Checked the auth service Dockerfile — the pool_size env var is hardcoded to 10 in the compose file. Production k8s configmap also has POOL_SIZE=10. This is the root cause. PR incoming to fix.
Story points changed from 5 → 8. Added sub-tasks for connection pool tuning and circuit breaker implementation.
Picking this up. Starting with reproduction via load test.
This is also affecting the payment retry logic (ALPHA-289) — payment service calls auth refresh internally and is seeing the same 502s. Flagging as a blocker for that ticket too.