Skillforge
All skills
incidents
updated 12d ago
4 facts
4 sources

responding-to-payment-incidents

Diagnoses payment service incidents and chooses safe mitigations consistent with Northwind's hard rules learned from past post-mortems. Use when payments-svc is degraded, the Stripe webhook queue is backing up, or any on-call situation involving payment processing. Prevents the pod-restart anti-pattern that caused the April 2025 double-charge incident.

skills/responding-to-payment-incidents/SKILL.md

---
name: responding-to-payment-incidents
description: Diagnoses payment service incidents and chooses safe mitigations consistent with Northwind's hard rules learned from past post-mortems. Use when payments-svc is degraded, the Stripe webhook queue is backing up, or any on-call situation involving payment processing. Prevents the pod-restart anti-pattern that caused the April 2025 double-charge incident.
---

Responding to Payment Incidents

This skill encodes Northwind's hard-won rules for handling payments-svc incidents. It exists because in April 2025 we caused a customer-facing double-charge incident by doing the obvious-but-wrong thing.

When to use this skill

You are on-call (or assisting on-call) and:

  • payments-svc P95 is elevated.
  • Stripe webhook queue depth is climbing.
  • Charges or refunds are failing or duplicating.
  • PagerDuty fired a payments-related alert.

The hard rule (read first)

Do NOT restart payments-svc pods to clear queue depth or relieve latency. Stripe has already received ack for in-flight webhooks. Restarting causes the new pods to re-process them, leading to duplicate charges.

This caused PM-2025-04-15 (40 customers double-charged, ~$28k in refunds, trust damage). It is the canonical reference for this rule. Every junior engineer should be told this on day one.

Instead: scale out horizontally (bump HPA max). Stripe's retry behavior tolerates delay; it does not tolerate duplicate processing.

Severity decision

  • SEV1: Customer-facing outage of payment processing, data loss, security incident.
  • SEV2: Latency 2x baseline+, partial degradation. Most queue-backup events.
  • SEV3: Cosmetic. No customer impact.

A retry storm is almost always SEV2. Don't over-declare; don't under-declare.

Initial response (first 5 minutes)

Incident Response Progress:
- [ ] Step 1: Acknowledge the page in PagerDuty
- [ ] Step 2: Open #inc-YYYYMMDD-shortdesc channel
- [ ] Step 3: Post the auto-incident card and tag service owner
- [ ] Step 4: Read this skill (you're doing that now)
- [ ] Step 5: Determine severity and act
- [ ] Step 6: Status page only if SEV1 (or SEV2 > 60 min)
- [ ] Step 7: Resolve, then write post-mortem within 72 hours

Diagnostic flow

  1. Open the payments-svc Datadog dashboard (or your equivalent — link from runbook page).
  2. Check upstream: Stripe status page. If Stripe is having an incident, expect retry storms during their recovery.
  3. Check queue depth: If > 5,000 → page on-call (auto-alerted). If > 14,000 → SEV2.
  4. Check P95 latency: If 4x baseline sustained > 5 min → SEV2.

Mitigation options

SymptomActionDO NOT
Queue depth climbing during Stripe replayBump HPA max (currently 32, can push to 48 in emergency)Restart pods
Single pod misbehaving (OOM, stuck)Scale that pod via deployment, drain gracefullyMass-restart
Database contentionPage DB on-call (David Park)Restart payments-svc pods to "reset"
In-flight webhook duplicationThis shouldn't happen — idempotency keys are mandatoryTry to "fix" by restarting

Escalation tree

DomainPrimaryBackup
Payments / billingDavid ParkAisha Rahman
Auth / identityMarco SilvaJenny Liu
Database / PostgresDavid Parkexternal (Neon support)
Frontend / CDNJenny Liu(none)

Communication

  • Internal: #inc-* channel + auto-summary every 30 min.
  • External (SEV1 only): status page + targeted email via Intercom template OUTAGE-NOTIFY-2026.
  • Do NOT post to Twitter / LinkedIn. Comms team owns external — loop in @comms-oncall via Slack.

After resolution

  1. Post resolution in #inc-* channel.
  2. Required for SEV1 and SEV2: write a post-mortem within 72 hours.
  3. Use the Notion post-mortem template. No-blame language only.
  4. Action items go in GitHub. Owner assigned. Due date set.

Reference incidents

  • PM-2026-02-14 — Stripe retry storm. 32 min, SEV2, no impact. Action: HPA bumped 12 → 32. Codified the no-restart rule.
  • PM-2025-04-15 — Pod restart double-charge. 45 min undetected, ~2 hr customer impact, 40 double-charges, ~$28k refunds. Why we have the no-restart rule.

Sources

  • On-Call Incident Response — Notion runbook (David Park, current)
  • PM-2025-04-15 — GitHub post-mortem (canonical reference)
  • PM-2026-02-14 — GitHub post-mortem (most recent)
  • #inc-20260214-payments — Slack incident channel