On-Call Handoff
Build a structured handoff document for the incoming on-call engineer by collecting data from PagerDuty, Jira, and ArgoCD.
Instructions
Phase 1: Current Incident State (PagerDuty Agent)
- •Active incidents - all triggered and acknowledged incidents
- •Recently resolved incidents - resolved in the last 24 hours (may recur)
- •Current on-call schedule - who is on-call now, who is taking over
- •Upcoming maintenance windows - scheduled within the next 48 hours
Phase 2: Ongoing Issues (Jira Agent)
- •Open incident tickets - Jira issues labeled
incident,outage,p0,p1 - •Known issues - issues labeled
known-issueorworkaround - •Recently closed issues - resolved in last 48 hours that may have follow-ups
- •Pending changes - tickets in "Ready for Deploy" or "In Review" status
Phase 3: Environment State (ArgoCD Agent)
- •Recent deployments - applications synced in the last 48 hours
- •Unhealthy applications - any OutOfSync, Degraded, or Unknown apps
- •Pending syncs - applications with pending changes not yet deployed
- •Recent rollbacks - any applications that were rolled back
Phase 4: Compile Handoff Document
Organize all data into a structured, scannable document with clear action items.
Output Format
markdown
## On-Call Handoff Document **Date**: February 9, 2026 **Outgoing**: @engineer-a (Feb 3 - Feb 9) **Incoming**: @engineer-b (Feb 9 - Feb 16) --- ### Active Incidents (Action Required) | Incident | Service | Urgency | Duration | Status | |----------|---------|---------|----------|--------| | INC-789 | auth-service | High | 2h 15m | Acknowledged | **INC-789 Context**: Auth service intermittent 503 errors. Root cause suspected to be database connection pool exhaustion. DBA team engaged. Workaround: restart auth-service pods if error rate exceeds 10%. ### Recently Resolved (Watch For Recurrence) | Incident | Service | Resolved | Duration | Root Cause | |----------|---------|----------|----------|------------| | INC-785 | payment-api | Feb 8 18:00 | 45m | Memory leak in v2.3.1 | ### Known Issues & Workarounds 1. **AUTH-456**: Auth service connection pool - restart pods if needed (ETA fix: Feb 12) 2. **PLAT-789**: Flaky integration tests - ignore `test_streaming` failures (known issue) ### Recent Deployments (Last 48h) | Application | Version | Deployed | Status | |-------------|---------|----------|--------| | payment-api | v2.3.2 (hotfix) | Feb 8 19:00 | Healthy | | auth-service | v1.8.0 | Feb 7 14:00 | Degraded | ### Unhealthy Applications | Application | Sync Status | Health | Since | |-------------|-------------|--------|-------| | auth-service | Synced | Degraded | Feb 8 16:00 | ### Pending Changes (Not Yet Deployed) - **monitoring-stack**: Prometheus alerting rule updates (PR #234 approved) - **api-gateway**: Rate limiting config change (scheduled for Feb 10) ### Systems to Watch 1. **auth-service** - Connection pool issue ongoing, monitor error rates 2. **payment-api** - Hotfix deployed yesterday, watch for regression 3. **EKS cluster-prod** - Node scaling event expected during peak hours (10am-2pm) ### Escalation Contacts | Team | Primary | Secondary | |------|---------|-----------| | Platform | @engineer-c | @engineer-d | | DBA | @dba-primary | @dba-secondary | | Security | @sec-oncall | - |
Examples
- •"Generate an on-call handoff document"
- •"Prepare a shift handoff for the incoming on-call engineer"
- •"What should the next on-call person know about?"
- •"Summarize the current state of production for handoff"
Guidelines
- •Prioritize actionable information - what does the incoming engineer need to DO?
- •Include workarounds for known issues so the incoming engineer does not have to search
- •Mark items as "action required" vs "monitor" vs "FYI" for clear prioritization
- •Always include escalation contacts with team context
- •Keep the last 48 hours as the default lookback window for context
- •If no active incidents exist, say so explicitly (this is good news)
- •Include any scheduled maintenance that falls within the on-call period