Dashboard Sprint TrackerALPHA-341

ALPHA-341BugCritical

Auth service returns 502 on token refresh under sustained load

Issue Details

Assignee

Marcus Webb

Reporter

Arjun Kapoor

Status

Blocked

Priority

Critical

Sprint

SP-24

Epic

Auth & Security

Story Points

Component

Auth Service

Labels

backendp0reliability

Created

May 12, 2026

Updated

May 14, 2026

Due Date

May 15, 2026

Description

The authentication service intermittently returns HTTP 502 Bad Gateway when handling token refresh requests under sustained concurrent load (>200 req/s). This causes downstream services to fail silently and users to be logged out unexpectedly.

Steps to reproduce:

Run k6 load test script at 200+ VUs targeting /api/auth/refresh
Observe 502 responses beginning at ~180 req/s sustained for >30 seconds
Check auth service logs — connection pool exhaustion visible at pool_size=10

Expected behavior: Token refresh succeeds with <200ms p99 latency up to 500 req/s.

Actual behavior: 502 errors begin at 180 req/s, error rate reaches 40% at 250 req/s. DB connection pool exhausted — max connections set to 10, should be 50+.

ERROR auth-service: dial tcp: connection refused (pool exhausted)

pool_max=10 pool_idle=0 pool_waiting=47 latency_p99=4821ms

WARN token-refresh: upstream returned 502, retrying (attempt 3/3)

Sub-tasks

1/4 complete

✓

ALPHA-341-1Reproduce 502 under load test — k6 scriptDoneMarcus Webb1pt

ALPHA-341-2Identify token refresh endpoint bottleneck via profilingIn ProgressMarcus Webb2pt

ALPHA-341-3Implement connection pool tuning for auth DBTo DoUnassigned3pt

ALPHA-341-4Add circuit breaker to token refresh pathTo DoUnassigned2pt

Linked Issues

3 links

blocksALPHA-289FeatureHighAdd retry logic to payment processor serviceIn Progress

is blocked byALPHA-298BugCriticalDB migration fails on staging — column type mismatchBlocked

relates toALPHA-317TaskMediumAuth service horizontal scaling spikeDone

Activity

Arjun KapoorMay 14, 2026 · 09:41

Confirmed reproduction. Pool size in staging is 10 — production config says 10 as well. This needs to be bumped to at least 50. Also need to add connection timeout handling so requests fail fast instead of queuing.

Marcus Webb

movedIn ProgressBlocked

May 14, 2026 · 08:22

Moving to Blocked — cannot proceed until DB migration (ALPHA-298) is resolved. The migration changes the connection pooling table structure.

Marcus Webba3f92c1May 13, 2026 · 17:05

feat(auth): add k6 load test reproducing 502 under 200VU — branch: fix/ALPHA-341-load-test

Dev PatelMay 13, 2026 · 14:30

Checked the auth service Dockerfile — the pool_size env var is hardcoded to 10 in the compose file. Production k8s configmap also has POOL_SIZE=10. This is the root cause. PR incoming to fix.

Arjun KapoorMay 12, 2026 · 16:18

Story points changed from 5 → 8. Added sub-tasks for connection pool tuning and circuit breaker implementation.

Marcus Webb

movedTo DoIn Progress

May 12, 2026 · 11:45

Picking this up. Starting with reproduction via load test.

Priya SharmaMay 12, 2026 · 10:02

This is also affecting the payment retry logic (ALPHA-289) — payment service calls auth refresh internally and is seeing the same 502s. Flagging as a blocker for that ticket too.