Debugging Omnistrate Deployments

When to Use This Skill

•Instance deployments showing FAILED or DEPLOYING status
•Resources with unhealthy pod statuses or deployment errors
•Startup/readiness probe failures (HTTP 503, timeouts)
•Helm releases with unclear deployment states
•Need to identify root cause of deployment failures

Progressive Debugging Workflow

1. Get Deployment Status

Tool: mcp__omnistrate-platform__omnistrate-ctl_instance_describe Flags: --deployment-status --output json

Extract:

•Overall instance status
•Resources with deployment errors or unhealthy pod statuses
•Focus subsequent analysis on problematic resources only

Key Benefit: Returns concise status, significantly reduces token usage vs full describe

2. Identify Workflows

Tool: mcp__omnistrate-platform__omnistrate-ctl_workflow_list Flags: --instance-id <id> --output json

Extract workflow IDs, types, and start/end times for failed deployments.

3. Analyze Workflow Events (Two-Phase)

Phase 1 - Summary (Always Start Here):

bash

omctl workflow events <workflow-id> --service-id <id> --environment-id <id> --output json

Extract:

•All resources with workflow step status (failed/in-progress/success)
•Step duration analysis and event count patterns
•Identify specific failed/stuck steps

Phase 2 - Detail (Only for Failed Steps):

bash

omctl workflow events <workflow-id> --service-id <id> --environment-id <id> \
  --resource-key <name> --step-types <type> --detail --output json

Use parameters:

•--resource-key: Target specific resource
•--step-types: Filter to specific step (Bootstrap, Compute, Deployment, Network, Storage, Monitoring)
•--detail: Include full event details (use sparingly)
•--since/--until: Time-bound queries

Extract from detail view:

•WorkflowStepDebug error messages
•VM allocation failures and constraints
•Pod scheduling issues
•Container readiness failures

Pod Event Timeline: Create ASCII visualizations showing deployment progression:

code

HH:MM:SS ┬─── ✗ FailedScheduling
         │    pod/app-0: Insufficient memory
         │
HH:MM:SS ├─── ⚡ TriggeredScaleUp
         │    nodegroup-1: adding 2 nodes
         │
HH:MM:SS ├─── 📥 Pulling image:latest
         │    (duration: 2m15s)
         │
HH:MM:SS └─── ✅ Started
              3/3 pods Running

Symbols: ✗ failed, ✅ success, ⚡ autoscaler, 💾 storage, 📥 image, 🚀 runtime, ⚠️ warning

4. Application-Level Investigation

When: Resource DEPLOYING with probe failures, containers Running but not Ready, no conclusive evidence from previous steps

Tool: mcp__omnistrate-platform__omnistrate-ctl_deployment-cell_update-kubeconfig + kubectl

bash

omctl deployment-cell update-kubeconfig <cell-id> --kubeconfig /tmp/kubeconfig
kubectl get pods -n <instance-id> --kubeconfig /tmp/kubeconfig
kubectl logs <pod-name> -c service -n <instance-id> --kubeconfig /tmp/kubeconfig --tail=50

Look for:

•Database connection failures
•Application syntax/runtime errors (Python SyntaxError, Java compilation errors)
•Service dependency failures
•Configuration issues

5. Helm-Specific Verification

When: Helm resources with conflicting status, need application credentials, deployment state unclear

Tool: Same kubeconfig setup with --role cluster-admin + helm

bash

omctl deployment-cell update-kubeconfig <cell-id> --kubeconfig /tmp/kubeconfig --role cluster-admin
helm list -n <instance-id> --kubeconfig /tmp/kubeconfig
helm status <release-name> -n <instance-id> --kubeconfig /tmp/kubeconfig

Extract:

•Release status (deployed/failed/pending)
•Revision number and last deployed time
•Application credentials from release notes
•Pod health ratio (Running vs Failed)

6. Broader Context (If Needed)

Tool: mcp__omnistrate-platform__omnistrate-ctl_operations_events

Use time windows from workflow analysis, filter by relevant event types.

Common Failure Patterns

Infrastructure Constraints

•VM allocation failures with restrictive constraints (AZ + Network + Size)
•PersistentVolumeClaim not found
•Node taints/affinity issues

Container Lifecycle

•Back-off restarting failed container
•ProcessLiveness: UNHEALTHY
•Image pull failures

Probe Failures

•Startup/readiness probe HTTP 503
•Database connectivity timeouts
•Application syntax errors preventing startup
•Service dependency unavailability

Resource Prioritization

•Core infrastructure: databases, message queues, storage
•Application services: web servers, APIs
•Support services: monitoring, logging

Response Management

•Always use --output json
•If token limit exceeded: add more specific filters, use smaller time windows, target specific resources
•Provide analysis in template format (see Failure Analysis Template in OMNISTRATE_SRE_REFERENCE.md)

Reference

See OMNISTRATE_SRE_REFERENCE.md for:

•Detailed tool parameter documentation
•Complete failure analysis template
•Extended examples
•Tool alternatives table