Skillforge
All sources
N
Notion·
runbook

On-Call Incident Response — Engineering

David Park · Staff SRE·updated 12d ago·Open in Notion

On-Call Incident Response (Engineering)

Severity definitions

  • SEV1: Customer-facing outage, data loss, or security incident. Revenue impact.
  • SEV2: Major degradation (>P95 latency 2x baseline, partial outage of one tier).
  • SEV3: Minor or cosmetic. No customer impact.

Initial response (first 5 minutes)

  1. Acknowledge the page in PagerDuty.
  2. Open an incident channel: #inc-YYYYMMDD-shortdesc.
  3. Post the auto-generated incident card. Tag the affected service owner.
  4. Status page for SEV1 only. Do not status-page SEV2 unless duration > 15 min.

Don't restart pods on payments-svc

Hard rule. Last incident (Feb 14, 2026) restarting payments-svc pods caused 30 minutes of dropped Stripe webhook retries. SRE approval required. See post-mortem PM-2026-02-14.

Escalation tree

  • Payments / billing: David Park (primary), Aisha Rahman (backup)
  • Auth / identity: Marco Silva (primary), Jenny Liu (backup)
  • Database / Postgres: David Park, then external (Neon support)
  • Frontend / CDN: Jenny Liu

Communication

  • Internal: #inc-* channel + auto-summary every 30 min.
  • External (SEV1 only): status page + targeted email to affected accounts (template in Intercom: "OUTAGE-NOTIFY-2026").
  • DO NOT post to Twitter/LinkedIn. Comms team owns external. Loop in @comms-oncall.

Post-mortem

Required for all SEV1 and SEV2. Use the post-mortem template in this Notion. Due within 72 hours of resolution. No-blame language only.