Skill: ops-ci-fix
Autonomous diagnosis and repair of failing CI/CD pipelines. Scan GitHub Actions workflows, identify failure causes, and apply fixes. Trigger when CI is broken, tests fail in CI, or workflows are stuck.
Configuration
| Property | Value |
|---|---|
| Context | fork |
| Allowed tools | Read, Write, Edit, Bash, Glob, Grep |
| Keywords | ops, fix |
Detailed description
CI Fixer — CI/CD Diagnosis and Repair
Goal
Diagnose failing CI/CD pipelines, identify the root cause, and apply automatic fixes when safe.
Phase 1: Discovery and workflow state
Scan the workflows
# List recent runs
gh run list --limit 20 --json databaseId,status,conclusion,name,createdAt,headBranch
# Identify workflow files
ls -la .github/workflows/
Classify the state
| State | Criterion | Urgency |
|---|---|---|
| Failed | conclusion = failure | High |
| Stuck | status = in_progress for > 30 min | High |
| Cancelled | conclusion = cancelled (recurring) | Medium |
| Stale | No successful run for 7+ days | Low |
Check the runners (if self-hosted)
# Runner status
gh api repos/{owner}/{repo}/actions/runners --jq '.runners[] | {name, status, busy}'
Phase 2: Failure diagnosis
For each failing workflow:
2.1 Extract the logs
# Logs of the failed run
gh run view <run-id> --log-failed
2.2 Classify the cause
| Category | Patterns in logs | Typical fix |
|---|---|---|
| Test failure | FAIL, AssertionError, expect( | Fix the test or the code |
| Build error | error TS, SyntaxError, cannot find | Fix the compilation error |
| Dep install | npm ERR!, ERESOLVE, peer dep | Fix package.json / lockfile |
| Auth/secrets | 401, 403, secret not found | Check the configured secrets |
| Timeout | timed out, exceeded deadline | Increase timeout or optimize |
| Disk space | no space left, ENOSPC | Clean caches / reduce artifacts |
| Rate limit | rate limit, 429 | Add retry / space out the requests |
| Runner offline | no runner matching, offline | Check self-hosted runners |
| Flaky test | Sometimes passes, sometimes fails | Identify the flaky test, stabilize it |
| Config error | invalid workflow, syntax error | Fix the workflow YAML |
2.3 Distinguish local error vs CI-only
# Reproduce locally
npm test # or pytest, go test, etc.
npm run build
npm run lint
If it passes locally but fails in CI: environment issue (versions, secrets, cache).
Phase 3: Repair
Fix priority order (from safest to riskiest)
- Re-run: flaky workflows →
gh run rerun <run-id> - Fix config: invalid YAML → edit
.github/workflows/ - Fix deps: corrupted lockfile →
rm -rf node_modules package-lock.json && npm install - Fix tests: breaking test → identify and fix
- Fix build: compilation error → fix the source code
- Cancel stuck: stuck workflows →
gh run cancel <run-id>
Guardrails
IMPORTANT: In --dry-run mode, show the proposed actions WITHOUT executing them.
| Action | Safe | Confirmation required |
|---|---|---|
| Re-run a workflow | Yes | No |
| Cancel a stuck run | Yes | No |
| Fix workflow YAML | Medium | Show the diff first |
| Regenerate lockfile | Medium | Show the diff first |
| Modify source code | Risky | Yes — propose, do not apply without approval |
| Modify secrets | Risky | Never — guide the user |
Applying fixes
For each applicable fix:
- Identify the precise root cause (not the symptom)
- Propose the minimal fix
- Apply if safe, otherwise show and wait for confirmation
- Verify the fix: re-run the workflow or run the tests locally
Phase 4: Verification
After the fixes:
# Check that tests pass locally
npm test && npm run build && npm run lint
# If a workflow was re-run, check its status
gh run view <run-id> --json status,conclusion
Validation loop (max 2 iterations)
- Apply the fix
- Verify (local tests + re-run CI if possible)
- If still failing: re-diagnose with the new logs
- If 2 iterations fail: escalate with a detailed report
Phase 5: Report
# CI Fix Report — YYYY-MM-DD
## Workflows analyzed
| Workflow | Branch | Status before | Cause | Action | Status after |
|----------|---------|-------------|-------|--------|-------------|
| ci.yml | main | Failed | Test failure | Fix test | Passing |
| deploy.yml | main | Stuck | Timeout | Cancel + re-run | In progress |
## Fixes applied
1. [Fix 1]: description, modified file, reason
2. [Fix 2]: ...
## Manual actions required
- [ ] Configure the `DEPLOY_TOKEN` secret (expired)
- [ ] Update the self-hosted runner v2.x → v3.x
## Recommendations
- Add a cache for npm ci (would reduce time by 3 min)
- The `auth.spec.ts` test is flaky (3 failures out of 10 runs)
Rules
- ALWAYS diagnose before fixing (Phase 2 before Phase 3)
- NEVER modify secrets — guide the user
- NEVER force-push or modify git history
- ALWAYS show the diff of workflow modifications before applying
- When in doubt, propose the fix without applying it
- Follow the 3-failures rule: after 2 failed fix iterations, escalate
Automatic triggering
This skill is automatically activated when:
- The matching keywords are detected in the conversation
- The task context matches the skill's domain
Triggering examples
- "I want to ops..."
- "I want to fix..."
Context fork
Fork means the skill runs in an isolated context:
- Does not pollute the main conversation
- Results are returned cleanly
- Ideal for autonomous tasks