responding-to-payment-incidents
Diagnoses payment service incidents and chooses safe mitigations consistent with Northwind's hard rules learned from past post-mortems. Use when payments-svc is degraded, the Stripe webhook queue is backing up, or any on-call situation involving payment processing. Prevents the pod-restart anti-pattern that caused the April 2025 double-charge incident.
skills/responding-to-payment-incidents/SKILL.md
--- name: responding-to-payment-incidents description: Diagnoses payment service incidents and chooses safe mitigations consistent with Northwind's hard rules learned from past post-mortems. Use when payments-svc is degraded, the Stripe webhook queue is backing up, or any on-call situation involving payment processing. Prevents the pod-restart anti-pattern that caused the April 2025 double-charge incident. ---
Responding to Payment Incidents
This skill encodes Northwind's hard-won rules for handling payments-svc incidents. It exists because in April 2025 we caused a customer-facing double-charge incident by doing the obvious-but-wrong thing.
When to use this skill
You are on-call (or assisting on-call) and:
- payments-svc P95 is elevated.
- Stripe webhook queue depth is climbing.
- Charges or refunds are failing or duplicating.
- PagerDuty fired a payments-related alert.
The hard rule (read first)
Do NOT restart payments-svc pods to clear queue depth or relieve latency. Stripe has already received ack for in-flight webhooks. Restarting causes the new pods to re-process them, leading to duplicate charges.
This caused PM-2025-04-15 (40 customers double-charged, ~$28k in refunds, trust damage). It is the canonical reference for this rule. Every junior engineer should be told this on day one.
Instead: scale out horizontally (bump HPA max). Stripe's retry behavior tolerates delay; it does not tolerate duplicate processing.
Severity decision
- SEV1: Customer-facing outage of payment processing, data loss, security incident.
- SEV2: Latency 2x baseline+, partial degradation. Most queue-backup events.
- SEV3: Cosmetic. No customer impact.
A retry storm is almost always SEV2. Don't over-declare; don't under-declare.
Initial response (first 5 minutes)
Incident Response Progress:
- [ ] Step 1: Acknowledge the page in PagerDuty
- [ ] Step 2: Open #inc-YYYYMMDD-shortdesc channel
- [ ] Step 3: Post the auto-incident card and tag service owner
- [ ] Step 4: Read this skill (you're doing that now)
- [ ] Step 5: Determine severity and act
- [ ] Step 6: Status page only if SEV1 (or SEV2 > 60 min)
- [ ] Step 7: Resolve, then write post-mortem within 72 hours
Diagnostic flow
- Open the payments-svc Datadog dashboard (or your equivalent — link from runbook page).
- Check upstream: Stripe status page. If Stripe is having an incident, expect retry storms during their recovery.
- Check queue depth: If > 5,000 → page on-call (auto-alerted). If > 14,000 → SEV2.
- Check P95 latency: If 4x baseline sustained > 5 min → SEV2.
Mitigation options
| Symptom | Action | DO NOT |
|---|---|---|
| Queue depth climbing during Stripe replay | Bump HPA max (currently 32, can push to 48 in emergency) | Restart pods |
| Single pod misbehaving (OOM, stuck) | Scale that pod via deployment, drain gracefully | Mass-restart |
| Database contention | Page DB on-call (David Park) | Restart payments-svc pods to "reset" |
| In-flight webhook duplication | This shouldn't happen — idempotency keys are mandatory | Try to "fix" by restarting |
Escalation tree
| Domain | Primary | Backup |
|---|---|---|
| Payments / billing | David Park | Aisha Rahman |
| Auth / identity | Marco Silva | Jenny Liu |
| Database / Postgres | David Park | external (Neon support) |
| Frontend / CDN | Jenny Liu | (none) |
Communication
- Internal: #inc-* channel + auto-summary every 30 min.
- External (SEV1 only): status page + targeted email via Intercom template OUTAGE-NOTIFY-2026.
- Do NOT post to Twitter / LinkedIn. Comms team owns external — loop in @comms-oncall via Slack.
After resolution
- Post resolution in #inc-* channel.
- Required for SEV1 and SEV2: write a post-mortem within 72 hours.
- Use the Notion post-mortem template. No-blame language only.
- Action items go in GitHub. Owner assigned. Due date set.
Reference incidents
- PM-2026-02-14 — Stripe retry storm. 32 min, SEV2, no impact. Action: HPA bumped 12 → 32. Codified the no-restart rule.
- PM-2025-04-15 — Pod restart double-charge. 45 min undetected, ~2 hr customer impact, 40 double-charges, ~$28k refunds. Why we have the no-restart rule.
Sources
- On-Call Incident Response — Notion runbook (David Park, current)
- PM-2025-04-15 — GitHub post-mortem (canonical reference)
- PM-2026-02-14 — GitHub post-mortem (most recent)
- #inc-20260214-payments — Slack incident channel