Skillforge
All sources
S
Slack·
thread
inc-20260214-payments

Payments-svc latency spike — Stripe webhook storm

David Park · Staff SRE·updated 2mo ago·Open in Slack

David Park [Feb 14, 3:00 PM] SEV2 declared. payments-svc P95 has been 4x baseline for last 12 min. Webhook queue depth at 14k.

Aisha Rahman [Feb 14, 3:04 PM] Looking. Stripe just sent us a retry storm — they had a downstream issue at 2:48 and now their replay is hitting us hard.

Marco Silva [Feb 14, 3:08 PM] Should we restart the pods to clear the queue?

David Park [Feb 14, 3:09 PM] NO. Do not restart payments-svc pods. We'll lose in-flight webhook deliveries that Stripe has already marked acknowledged. Last time we did that we double-charged ~40 customers.

David Park We need to scale out horizontally. I'm bumping the HPA max from 12 to 24. Already done. Should drain in ~8 minutes.

Aisha Rahman [Feb 14, 3:18 PM] Queue depth dropping. P95 normalizing.

David Park [Feb 14, 3:32 PM] Resolved. Total impact: P95 elevated for 32 min, no data loss, no double-charges. Will write up post-mortem PM-2026-02-14.

Sarah Chen [Feb 14, 6:30 PM] @David — should I status-page this retroactively for the 2 enterprise accounts that asked?

David Park No public status page for SEV2 under 60 min by policy. Targeted email to affected enterprise accounts only. Ping me their list and I'll generate the impact summary.