Aller au contenu principal

Skill: ops-ci-fix

Fork

Autonomous diagnosis and repair of failing CI/CD pipelines. Scan GitHub Actions workflows, identify failure causes, and apply fixes. Trigger when CI is broken, tests fail in CI, or workflows are stuck.

Configuration

PropertyValue
Contextfork
Allowed toolsRead, Write, Edit, Bash, Glob, Grep
Keywordsops, fix

Detailed description

CI Fixer — CI/CD Diagnosis and Repair

Goal

Diagnose failing CI/CD pipelines, identify the root cause, and apply automatic fixes when safe.

Phase 1: Discovery and workflow state

Scan the workflows

# List recent runs
gh run list --limit 20 --json databaseId,status,conclusion,name,createdAt,headBranch

# Identify workflow files
ls -la .github/workflows/

Classify the state

StateCriterionUrgency
Failedconclusion = failureHigh
Stuckstatus = in_progress for > 30 minHigh
Cancelledconclusion = cancelled (recurring)Medium
StaleNo successful run for 7+ daysLow

Check the runners (if self-hosted)

# Runner status
gh api repos/{owner}/{repo}/actions/runners --jq '.runners[] | {name, status, busy}'

Phase 2: Failure diagnosis

For each failing workflow:

2.1 Extract the logs

# Logs of the failed run
gh run view <run-id> --log-failed

2.2 Classify the cause

CategoryPatterns in logsTypical fix
Test failureFAIL, AssertionError, expect(Fix the test or the code
Build errorerror TS, SyntaxError, cannot findFix the compilation error
Dep installnpm ERR!, ERESOLVE, peer depFix package.json / lockfile
Auth/secrets401, 403, secret not foundCheck the configured secrets
Timeouttimed out, exceeded deadlineIncrease timeout or optimize
Disk spaceno space left, ENOSPCClean caches / reduce artifacts
Rate limitrate limit, 429Add retry / space out the requests
Runner offlineno runner matching, offlineCheck self-hosted runners
Flaky testSometimes passes, sometimes failsIdentify the flaky test, stabilize it
Config errorinvalid workflow, syntax errorFix the workflow YAML

2.3 Distinguish local error vs CI-only

# Reproduce locally
npm test # or pytest, go test, etc.
npm run build
npm run lint

If it passes locally but fails in CI: environment issue (versions, secrets, cache).

Phase 3: Repair

Fix priority order (from safest to riskiest)

  1. Re-run: flaky workflows → gh run rerun <run-id>
  2. Fix config: invalid YAML → edit .github/workflows/
  3. Fix deps: corrupted lockfile → rm -rf node_modules package-lock.json && npm install
  4. Fix tests: breaking test → identify and fix
  5. Fix build: compilation error → fix the source code
  6. Cancel stuck: stuck workflows → gh run cancel <run-id>

Guardrails

IMPORTANT: In --dry-run mode, show the proposed actions WITHOUT executing them.

ActionSafeConfirmation required
Re-run a workflowYesNo
Cancel a stuck runYesNo
Fix workflow YAMLMediumShow the diff first
Regenerate lockfileMediumShow the diff first
Modify source codeRiskyYes — propose, do not apply without approval
Modify secretsRiskyNever — guide the user

Applying fixes

For each applicable fix:

  1. Identify the precise root cause (not the symptom)
  2. Propose the minimal fix
  3. Apply if safe, otherwise show and wait for confirmation
  4. Verify the fix: re-run the workflow or run the tests locally

Phase 4: Verification

After the fixes:

# Check that tests pass locally
npm test && npm run build && npm run lint

# If a workflow was re-run, check its status
gh run view <run-id> --json status,conclusion

Validation loop (max 2 iterations)

  1. Apply the fix
  2. Verify (local tests + re-run CI if possible)
  3. If still failing: re-diagnose with the new logs
  4. If 2 iterations fail: escalate with a detailed report

Phase 5: Report

# CI Fix Report — YYYY-MM-DD

## Workflows analyzed
| Workflow | Branch | Status before | Cause | Action | Status after |
|----------|---------|-------------|-------|--------|-------------|
| ci.yml | main | Failed | Test failure | Fix test | Passing |
| deploy.yml | main | Stuck | Timeout | Cancel + re-run | In progress |

## Fixes applied
1. [Fix 1]: description, modified file, reason
2. [Fix 2]: ...

## Manual actions required
- [ ] Configure the `DEPLOY_TOKEN` secret (expired)
- [ ] Update the self-hosted runner v2.x → v3.x

## Recommendations
- Add a cache for npm ci (would reduce time by 3 min)
- The `auth.spec.ts` test is flaky (3 failures out of 10 runs)

Rules

  • ALWAYS diagnose before fixing (Phase 2 before Phase 3)
  • NEVER modify secrets — guide the user
  • NEVER force-push or modify git history
  • ALWAYS show the diff of workflow modifications before applying
  • When in doubt, propose the fix without applying it
  • Follow the 3-failures rule: after 2 failed fix iterations, escalate

Automatic triggering

This skill is automatically activated when:

  • The matching keywords are detected in the conversation
  • The task context matches the skill's domain

Triggering examples

  • "I want to ops..."
  • "I want to fix..."

Context fork

Fork means the skill runs in an isolated context:

  • Does not pollute the main conversation
  • Results are returned cleanly
  • Ideal for autonomous tasks

See also