PM-2026-02-14: payments-svc Stripe webhook retry storm

David Park · Staff SRE·updated 2mo ago·Open in GitHub

PM-2026-02-14: payments-svc Stripe webhook retry storm

Severity: SEV2 Duration: 32 minutes (3:00 PM – 3:32 PM PT) Customer impact: P95 latency 4x baseline for payments-related operations. No data loss. No double-charges.

Timeline

2:48 PM — Stripe internal incident (per their status page).
3:00 PM — Their replay traffic hits our webhook endpoint. Queue depth balloons from <100 to 14k in ~12 min. PagerDuty fires.
3:04 PM — David ack'd. Aisha joined.
3:08 PM — Marco proposed pod restart. Rejected by David — that's how we caused the Apr 2025 double-charge incident.
3:09 PM — David bumps payments-svc HPA max from 12 to 24.
3:18 PM — Queue draining. P95 normalizing.
3:32 PM — Resolved.

Root cause

Our HPA was tuned for steady-state load, not retry storms. We had 12 pods and the queue work-stealing pattern saturates around 800 webhooks/sec/pod. Stripe sent us ~1100/sec sustained for ~12 min.

What went well

David caught the pod-restart proposal before it shipped. This saved us from a customer-impacting incident.
HPA bump applied cleanly, drain was orderly.

What didn't

Our HPA max was too low. Should be tuned for 4x baseline, not 1.5x.
Our runbook didn't have a "what to do on a retry storm" section — just generic latency.

Action items

Bump payments-svc HPA max to 32 (David, done 2026-02-15).
Add "retry storm" section to incident runbook (David, done 2026-02-17).
Implement PagerDuty alert for queue depth > 5k (Aisha, due 2026-03-01).
Codify "do not restart payments-svc pods" as a runbook hard rule (David, done 2026-02-17).

Lessons

Hard rule: Do NOT restart payments-svc pods to clear queue depth. Scale out instead.
Hard rule: Anything > 5k queue depth → page on-call.