Logo
JiraDash
Analytics
Dashboard
Sprint Tracker2
Bug Tracker7
Epics & Features
Team Workload
Issue Detail
AK

Arjun Kapoor

Tech Lead

⌘K
Live
DashboardSprint TrackerALPHA-341
ALPHA-341BugCritical

Auth service returns 502 on token refresh under sustained load

Issue Details

Assignee
MW
Marcus Webb
Reporter
AK
Arjun Kapoor
Status
Blocked
Priority
Critical
Sprint
SP-24
Epic
Auth & Security
Story Points
8
Component
Auth Service
Labels
backendp0reliability
Created
May 12, 2026
Updated
May 14, 2026
Due Date
May 15, 2026

Description

The authentication service intermittently returns HTTP 502 Bad Gateway when handling token refresh requests under sustained concurrent load (>200 req/s). This causes downstream services to fail silently and users to be logged out unexpectedly.

Steps to reproduce:

  1. Run k6 load test script at 200+ VUs targeting /api/auth/refresh
  2. Observe 502 responses beginning at ~180 req/s sustained for >30 seconds
  3. Check auth service logs — connection pool exhaustion visible at pool_size=10

Expected behavior: Token refresh succeeds with <200ms p99 latency up to 500 req/s.

Actual behavior: 502 errors begin at 180 req/s, error rate reaches 40% at 250 req/s. DB connection pool exhausted — max connections set to 10, should be 50+.

ERROR auth-service: dial tcp: connection refused (pool exhausted)

pool_max=10 pool_idle=0 pool_waiting=47 latency_p99=4821ms

WARN token-refresh: upstream returned 502, retrying (attempt 3/3)

Sub-tasks

1/4 complete
✓
ALPHA-341-1Reproduce 502 under load test — k6 scriptDoneMarcus Webb1pt
ALPHA-341-2Identify token refresh endpoint bottleneck via profilingIn ProgressMarcus Webb2pt
ALPHA-341-3Implement connection pool tuning for auth DBTo DoUnassigned3pt
ALPHA-341-4Add circuit breaker to token refresh pathTo DoUnassigned2pt

Linked Issues

3 links
blocksALPHA-289FeatureHighAdd retry logic to payment processor serviceIn Progress
is blocked byALPHA-298BugCriticalDB migration fails on staging — column type mismatchBlocked
relates toALPHA-317TaskMediumAuth service horizontal scaling spikeDone

Activity

AK
AK
Arjun KapoorMay 14, 2026 · 09:41

Confirmed reproduction. Pool size in staging is 10 — production config says 10 as well. This needs to be bumped to at least 50. Also need to add connection timeout handling so requests fail fast instead of queuing.

MW
Marcus Webb
movedIn ProgressBlocked
May 14, 2026 · 08:22

Moving to Blocked — cannot proceed until DB migration (ALPHA-298) is resolved. The migration changes the connection pooling table structure.

MW
Marcus Webba3f92c1May 13, 2026 · 17:05

feat(auth): add k6 load test reproducing 502 under 200VU — branch: fix/ALPHA-341-load-test

DP
Dev PatelMay 13, 2026 · 14:30

Checked the auth service Dockerfile — the pool_size env var is hardcoded to 10 in the compose file. Production k8s configmap also has POOL_SIZE=10. This is the root cause. PR incoming to fix.

AK
Arjun KapoorMay 12, 2026 · 16:18

Story points changed from 5 → 8. Added sub-tasks for connection pool tuning and circuit breaker implementation.

MW
Marcus Webb
movedTo DoIn Progress
May 12, 2026 · 11:45

Picking this up. Starting with reproduction via load test.

PS
Priya SharmaMay 12, 2026 · 10:02

This is also affecting the payment retry logic (ALPHA-289) — payment service calls auth refresh internally and is seeing the same 502s. Flagging as a blocker for that ticket too.