Debugging Omnistrate Deployments
When to Use This Skill
- •Instance deployments showing FAILED or DEPLOYING status
- •Resources with unhealthy pod statuses or deployment errors
- •Startup/readiness probe failures (HTTP 503, timeouts)
- •Helm releases with unclear deployment states
- •Need to identify root cause of deployment failures
Progressive Debugging Workflow
1. Get Deployment Status
Tool: mcp__omnistrate-platform__omnistrate-ctl_instance_describe
Flags: --deployment-status --output json
Extract:
- •Overall instance status
- •Resources with deployment errors or unhealthy pod statuses
- •Focus subsequent analysis on problematic resources only
Key Benefit: Returns concise status, significantly reduces token usage vs full describe
2. Identify Workflows
Tool: mcp__omnistrate-platform__omnistrate-ctl_workflow_list
Flags: --instance-id <id> --output json
Extract workflow IDs, types, and start/end times for failed deployments.
3. Analyze Workflow Events (Two-Phase)
Phase 1 - Summary (Always Start Here):
omctl workflow events <workflow-id> --service-id <id> --environment-id <id> --output json
Extract:
- •All resources with workflow step status (failed/in-progress/success)
- •Step duration analysis and event count patterns
- •Identify specific failed/stuck steps
Phase 2 - Detail (Only for Failed Steps):
omctl workflow events <workflow-id> --service-id <id> --environment-id <id> \ --resource-key <name> --step-types <type> --detail --output json
Use parameters:
- •
--resource-key: Target specific resource - •
--step-types: Filter to specific step (Bootstrap, Compute, Deployment, Network, Storage, Monitoring) - •
--detail: Include full event details (use sparingly) - •
--since/--until: Time-bound queries
Extract from detail view:
- •WorkflowStepDebug error messages
- •VM allocation failures and constraints
- •Pod scheduling issues
- •Container readiness failures
Pod Event Timeline: Create ASCII visualizations showing deployment progression:
HH:MM:SS ┬─── ✗ FailedScheduling
│ pod/app-0: Insufficient memory
│
HH:MM:SS ├─── ⚡ TriggeredScaleUp
│ nodegroup-1: adding 2 nodes
│
HH:MM:SS ├─── 📥 Pulling image:latest
│ (duration: 2m15s)
│
HH:MM:SS └─── ✅ Started
3/3 pods Running
Symbols: ✗ failed, ✅ success, ⚡ autoscaler, 💾 storage, 📥 image, 🚀 runtime, ⚠️ warning
4. Application-Level Investigation
When: Resource DEPLOYING with probe failures, containers Running but not Ready, no conclusive evidence from previous steps
Tool: mcp__omnistrate-platform__omnistrate-ctl_deployment-cell_update-kubeconfig + kubectl
omctl deployment-cell update-kubeconfig <cell-id> --kubeconfig /tmp/kubeconfig kubectl get pods -n <instance-id> --kubeconfig /tmp/kubeconfig kubectl logs <pod-name> -c service -n <instance-id> --kubeconfig /tmp/kubeconfig --tail=50
Look for:
- •Database connection failures
- •Application syntax/runtime errors (Python SyntaxError, Java compilation errors)
- •Service dependency failures
- •Configuration issues
5. Helm-Specific Verification
When: Helm resources with conflicting status, need application credentials, deployment state unclear
Tool: Same kubeconfig setup with --role cluster-admin + helm
omctl deployment-cell update-kubeconfig <cell-id> --kubeconfig /tmp/kubeconfig --role cluster-admin helm list -n <instance-id> --kubeconfig /tmp/kubeconfig helm status <release-name> -n <instance-id> --kubeconfig /tmp/kubeconfig
Extract:
- •Release status (deployed/failed/pending)
- •Revision number and last deployed time
- •Application credentials from release notes
- •Pod health ratio (Running vs Failed)
6. Broader Context (If Needed)
Tool: mcp__omnistrate-platform__omnistrate-ctl_operations_events
Use time windows from workflow analysis, filter by relevant event types.
Common Failure Patterns
Infrastructure Constraints
- •VM allocation failures with restrictive constraints (AZ + Network + Size)
- •PersistentVolumeClaim not found
- •Node taints/affinity issues
Container Lifecycle
- •Back-off restarting failed container
- •ProcessLiveness: UNHEALTHY
- •Image pull failures
Probe Failures
- •Startup/readiness probe HTTP 503
- •Database connectivity timeouts
- •Application syntax errors preventing startup
- •Service dependency unavailability
Resource Prioritization
- •Core infrastructure: databases, message queues, storage
- •Application services: web servers, APIs
- •Support services: monitoring, logging
Response Management
- •Always use
--output json - •If token limit exceeded: add more specific filters, use smaller time windows, target specific resources
- •Provide analysis in template format (see Failure Analysis Template in OMNISTRATE_SRE_REFERENCE.md)
Reference
See OMNISTRATE_SRE_REFERENCE.md for:
- •Detailed tool parameter documentation
- •Complete failure analysis template
- •Extended examples
- •Tool alternatives table