N
Notion·runbook
On-Call Incident Response — Engineering
On-Call Incident Response (Engineering)
Severity definitions
- SEV1: Customer-facing outage, data loss, or security incident. Revenue impact.
- SEV2: Major degradation (>P95 latency 2x baseline, partial outage of one tier).
- SEV3: Minor or cosmetic. No customer impact.
Initial response (first 5 minutes)
- Acknowledge the page in PagerDuty.
- Open an incident channel: #inc-YYYYMMDD-shortdesc.
- Post the auto-generated incident card. Tag the affected service owner.
- Status page for SEV1 only. Do not status-page SEV2 unless duration > 15 min.
Don't restart pods on payments-svc
Hard rule. Last incident (Feb 14, 2026) restarting payments-svc pods caused 30 minutes of dropped Stripe webhook retries. SRE approval required. See post-mortem PM-2026-02-14.
Escalation tree
- Payments / billing: David Park (primary), Aisha Rahman (backup)
- Auth / identity: Marco Silva (primary), Jenny Liu (backup)
- Database / Postgres: David Park, then external (Neon support)
- Frontend / CDN: Jenny Liu
Communication
- Internal: #inc-* channel + auto-summary every 30 min.
- External (SEV1 only): status page + targeted email to affected accounts (template in Intercom: "OUTAGE-NOTIFY-2026").
- DO NOT post to Twitter/LinkedIn. Comms team owns external. Loop in @comms-oncall.
Post-mortem
Required for all SEV1 and SEV2. Use the post-mortem template in this Notion. Due within 72 hours of resolution. No-blame language only.