responding-to-payment-incidents

Diagnoses payment service incidents and chooses safe mitigations consistent with Northwind's hard rules learned from past post-mortems. Use when payments-svc is degraded, the Stripe webhook queue is backing up, or any on-call situation involving payment processing. Prevents the pod-restart anti-pattern that caused the April 2025 double-charge incident.

skills/responding-to-payment-incidents/SKILL.md

Claude Agent Skills format

---
name: responding-to-payment-incidents
description: Diagnoses payment service incidents and chooses safe mitigations consistent with Northwind's hard rules learned from past post-mortems. Use when payments-svc is degraded, the Stripe webhook queue is backing up, or any on-call situation involving payment processing. Prevents the pod-restart anti-pattern that caused the April 2025 double-charge incident.
---

Responding to Payment Incidents

This skill encodes Northwind's hard-won rules for handling payments-svc incidents. It exists because in April 2025 we caused a customer-facing double-charge incident by doing the obvious-but-wrong thing.

When to use this skill

You are on-call (or assisting on-call) and:

payments-svc P95 is elevated.
Stripe webhook queue depth is climbing.
Charges or refunds are failing or duplicating.
PagerDuty fired a payments-related alert.

The hard rule (read first)

Do NOT restart payments-svc pods to clear queue depth or relieve latency. Stripe has already received ack for in-flight webhooks. Restarting causes the new pods to re-process them, leading to duplicate charges.

This caused PM-2025-04-15 (40 customers double-charged, ~$28k in refunds, trust damage). It is the canonical reference for this rule. Every junior engineer should be told this on day one.

Instead: scale out horizontally (bump HPA max). Stripe's retry behavior tolerates delay; it does not tolerate duplicate processing.

Severity decision

SEV1: Customer-facing outage of payment processing, data loss, security incident.
SEV2: Latency 2x baseline+, partial degradation. Most queue-backup events.
SEV3: Cosmetic. No customer impact.

A retry storm is almost always SEV2. Don't over-declare; don't under-declare.

Initial response (first 5 minutes)

Incident Response Progress:
- [ ] Step 1: Acknowledge the page in PagerDuty
- [ ] Step 2: Open #inc-YYYYMMDD-shortdesc channel
- [ ] Step 3: Post the auto-incident card and tag service owner
- [ ] Step 4: Read this skill (you're doing that now)
- [ ] Step 5: Determine severity and act
- [ ] Step 6: Status page only if SEV1 (or SEV2 > 60 min)
- [ ] Step 7: Resolve, then write post-mortem within 72 hours

Diagnostic flow

Open the payments-svc Datadog dashboard (or your equivalent — link from runbook page).
Check upstream: Stripe status page. If Stripe is having an incident, expect retry storms during their recovery.
Check queue depth: If > 5,000 → page on-call (auto-alerted). If > 14,000 → SEV2.
Check P95 latency: If 4x baseline sustained > 5 min → SEV2.

Mitigation options

Symptom	Action	DO NOT
Queue depth climbing during Stripe replay	Bump HPA max (currently 32, can push to 48 in emergency)	Restart pods
Single pod misbehaving (OOM, stuck)	Scale that pod via deployment, drain gracefully	Mass-restart
Database contention	Page DB on-call (David Park)	Restart payments-svc pods to "reset"
In-flight webhook duplication	This shouldn't happen — idempotency keys are mandatory	Try to "fix" by restarting

Escalation tree

Domain	Primary	Backup
Payments / billing	David Park	Aisha Rahman
Auth / identity	Marco Silva	Jenny Liu
Database / Postgres	David Park	external (Neon support)
Frontend / CDN	Jenny Liu	(none)

Communication

Internal: #inc-* channel + auto-summary every 30 min.
External (SEV1 only): status page + targeted email via Intercom template OUTAGE-NOTIFY-2026.
Do NOT post to Twitter / LinkedIn. Comms team owns external — loop in @comms-oncall via Slack.

After resolution

Post resolution in #inc-* channel.
Required for SEV1 and SEV2: write a post-mortem within 72 hours.
Use the Notion post-mortem template. No-blame language only.
Action items go in GitHub. Owner assigned. Due date set.

Reference incidents

PM-2026-02-14 — Stripe retry storm. 32 min, SEV2, no impact. Action: HPA bumped 12 → 32. Codified the no-restart rule.
PM-2025-04-15 — Pod restart double-charge. 45 min undetected, ~2 hr customer impact, 40 double-charges, ~$28k refunds. Why we have the no-restart rule.

Sources

On-Call Incident Response — Notion runbook (David Park, current)
PM-2025-04-15 — GitHub post-mortem (canonical reference)
PM-2026-02-14 — GitHub post-mortem (most recent)
#inc-20260214-payments — Slack incident channel

Backing facts

constraint

HARD RULE: do NOT restart payments-svc pods to clear queue depth. Stripe has already acked in-flight webhooks; restarting causes duplicate charges. Scale out via HPA instead.

3 sources · 99% confidence

procedure

On payments queue backup, scale out by bumping HPA max (currently 32). Was bumped from 12 → 24 during PM-2026-02-14, then to 32 in action item.

1 source · 96% confidence

person

Payments / billing escalation: David Park (primary), Aisha Rahman (backup).

1 source · 99% confidence

policy

SEV1 = customer-facing outage / data loss / security. SEV2 = major degradation (P95 latency 2x baseline, partial outage). SEV3 = cosmetic, no customer impact. Status page only for SEV1, or SEV2 lasting > 60 min.

1 source · 97% confidence

Source documents

On-Call Incident Response — Engineering

David Park · 12d ago

Payments-svc latency spike — Stripe webhook storm

David Park · 2mo ago

PM-2026-02-14: payments-svc Stripe webhook retry storm

David Park · 2mo ago

PM-2025-04-15: payments-svc pod restart caused 40 double-charges

David Park · 1y ago

Try this skill in the demo