GH
GitHub·post-mortem
PM-2025-04-15: payments-svc pod restart caused 40 double-charges
PM-2025-04-15: payments-svc pod restart double-charge incident
Severity: SEV1 Duration: 45 min until detected; ~2 hours customer impact for affected accounts. Customer impact: 40 customers double-charged. ~$28k refunded. Trust damage.
Summary
During a routine queue-depth spike, on-call (Marco at the time) restarted the payments-svc deployment to clear the queue. Stripe had already received ack for ~40 webhooks that the new pods then re-processed, causing duplicate charges.
Lessons (codified in runbook)
- Never restart payments-svc pods to clear backpressure. Scale out.
- Idempotency keys on webhook handlers — implemented post-incident.
- Queue depth alerting thresholds.
This is the canonical "why we don't restart payments pods" reference. Junior engineers may not know this — please include in onboarding.