Skillforge
All sources
GH
GitHub·
post-mortem

PM-2025-04-15: payments-svc pod restart caused 40 double-charges

David Park · Staff SRE·updated 1y ago·Open in GitHub

PM-2025-04-15: payments-svc pod restart double-charge incident

Severity: SEV1 Duration: 45 min until detected; ~2 hours customer impact for affected accounts. Customer impact: 40 customers double-charged. ~$28k refunded. Trust damage.

Summary

During a routine queue-depth spike, on-call (Marco at the time) restarted the payments-svc deployment to clear the queue. Stripe had already received ack for ~40 webhooks that the new pods then re-processed, causing duplicate charges.

Lessons (codified in runbook)

  • Never restart payments-svc pods to clear backpressure. Scale out.
  • Idempotency keys on webhook handlers — implemented post-incident.
  • Queue depth alerting thresholds.

This is the canonical "why we don't restart payments pods" reference. Junior engineers may not know this — please include in onboarding.