GH
GitHub·post-mortem
PM-2026-02-14: payments-svc Stripe webhook retry storm
PM-2026-02-14: payments-svc Stripe webhook retry storm
Severity: SEV2 Duration: 32 minutes (3:00 PM – 3:32 PM PT) Customer impact: P95 latency 4x baseline for payments-related operations. No data loss. No double-charges.
Timeline
- 2:48 PM — Stripe internal incident (per their status page).
- 3:00 PM — Their replay traffic hits our webhook endpoint. Queue depth balloons from <100 to 14k in ~12 min. PagerDuty fires.
- 3:04 PM — David ack'd. Aisha joined.
- 3:08 PM — Marco proposed pod restart. Rejected by David — that's how we caused the Apr 2025 double-charge incident.
- 3:09 PM — David bumps payments-svc HPA max from 12 to 24.
- 3:18 PM — Queue draining. P95 normalizing.
- 3:32 PM — Resolved.
Root cause
Our HPA was tuned for steady-state load, not retry storms. We had 12 pods and the queue work-stealing pattern saturates around 800 webhooks/sec/pod. Stripe sent us ~1100/sec sustained for ~12 min.
What went well
- David caught the pod-restart proposal before it shipped. This saved us from a customer-impacting incident.
- HPA bump applied cleanly, drain was orderly.
What didn't
- Our HPA max was too low. Should be tuned for 4x baseline, not 1.5x.
- Our runbook didn't have a "what to do on a retry storm" section — just generic latency.
Action items
- Bump payments-svc HPA max to 32 (David, done 2026-02-15).
- Add "retry storm" section to incident runbook (David, done 2026-02-17).
- Implement PagerDuty alert for queue depth > 5k (Aisha, due 2026-03-01).
- Codify "do not restart payments-svc pods" as a runbook hard rule (David, done 2026-02-17).
Lessons
- Hard rule: Do NOT restart payments-svc pods to clear queue depth. Scale out instead.
- Hard rule: Anything > 5k queue depth → page on-call.