AI for DevOps — CI/CD, Infra & Monitoring
AI GitHub Actions, Dockerfiles, infra code, incident management
DevOps involves a lot of configuration — YAML for CI/CD pipelines, Dockerfiles, Kubernetes manifests, Terraform configs. Most of this is mechanical work that follows established patterns. AI tools are excellent at generating, reviewing, and debugging configuration files because they have seen thousands of examples of correct configurations. This guide covers the key DevOps use cases.
What You'll Learn
- AI-generated GitHub Actions workflows (from description to working YAML)
- Dockerfile generation and optimization with Copilot
- Infrastructure as code with AI (Terraform, Kubernetes)
- Using Claude to debug CI failures
- AI for incident management and on-call support
AI-Generated GitHub Actions Workflows
GitHub Actions YAML is verbose and easy to get wrong. AI eliminates most of the trial-and-error.
Prompt approach — describe your pipeline, get working YAML:
Write a GitHub Actions workflow for a Next.js 15 application.
Requirements:
- Trigger on push to main and pull_request to main
- Install Node.js 20
- Run npm ci (not npm install)
- Run TypeScript type check: npx tsc --noEmit
- Run ESLint: npm run lint
- Build: npm run build
- Only deploy if all checks pass AND the branch is main
- Deploy to Vercel using VERCEL_TOKEN secret
Use caching for node_modules to speed up builds.
Fail fast: if type check fails, skip lint and build.
Claude or Copilot generates a complete, working workflow file. Review the caching strategy and secret names — those need to match your actual repository secrets.
Debugging a failing workflow:
My GitHub Actions workflow is failing with this error:
Error: Process completed with exit code 1.
npm ERR! code ENOENT
npm ERR! syscall open
npm ERR! path /home/runner/work/myapp/myapp/package.json
Here is my workflow YAML:
[paste workflow YAML]
What is wrong and how do I fix it?
Claude explains the issue (usually a wrong working directory) and gives the corrected YAML.
🇮🇳 India Note: Indian development teams deploying to Azure often need workflows that authenticate with Azure using service principals. Claude can generate the full Azure login action configuration and explain the required Azure AD application setup.
Dockerfile Generation and Optimization
Prompt for a new Dockerfile:
Write a production-ready Dockerfile for a FastAPI Python application.
Requirements:
- Python 3.12
- Multi-stage build (keep final image small)
- First stage: install dependencies
- Final stage: copy app code, set non-root user, expose port 8000
- Use gunicorn as the production server with uvicorn workers
- Dependencies in requirements.txt
- Health check endpoint: GET /health returns 200
The main app file is main.py, application object is `app`.
Optimizing an existing Dockerfile:
Review this Dockerfile for efficiency and security issues.
Focus on: image size, build cache optimization, security (running as root, exposed secrets, vulnerable base image).
Dockerfile:
[paste your Dockerfile]
Common improvements AI catches:
- Using a specific tag instead of
latest(reproducible builds) - Combining RUN commands to reduce layers
- Using
.dockerignoreto exclude unnecessary files - Running as non-root user (security)
- Using distroless or slim base images to reduce attack surface
Infrastructure as Code with AI
Terraform for Indian cloud deployments:
Write Terraform configuration to deploy a Next.js app on Azure App Service.
Requirements:
- Region: Central India (centralindia)
- App Service Plan: B1 tier (cost-effective for staging)
- App Service: Linux, Node.js 20
- Environment variables from Azure Key Vault
- Custom domain support
- Resource naming: {project}-{environment}-{resource} (e.g., myapp-prod-webapp)
Include: resource group, app service plan, app service, key vault reference.
Use azurerm provider.
Kubernetes manifests:
Generate Kubernetes deployment and service manifests for a Python API.
App details:
- Image: myregistry.azurecr.io/myapp:latest
- Port: 8000
- Environment variables: DATABASE_URL, JWT_SECRET (from Kubernetes secret)
- Health check: GET /health
- Resource limits: 256Mi memory, 250m CPU
- 2 replicas minimum
- Service type: ClusterIP (internal only, behind an ingress)
Debugging CI Failures with Claude
CI failure messages are often cryptic. Claude is excellent at decoding them because it has seen thousands of similar error patterns.
The best CI debugging prompt:
My GitHub Actions CI is failing. Here is the complete output from the failing step:
[paste the FULL output, not just the error line — the context before the error is often more important]
My workflow file: [paste relevant parts]
What is causing the failure and what do I need to change?
Common CI issues Claude handles well:
- Dependency installation failures (package version conflicts)
- Environment variable not found (secret not configured or wrong name)
- Test failures due to missing test fixtures or environment setup
- Build failures from TypeScript or ESLint errors
- Docker build failures (missing files, wrong COPY paths)
- Deployment failures (authentication, wrong resource names)
Tip: Always paste the full log output, not just the error line. The error often occurs several lines after the actual problem.
AI for Incident Management and On-Call
When something breaks in production at 2 AM, AI can help:
Analyzing error logs:
Production is showing 500 errors. Here are the last 50 error log entries:
[paste log output]
The service is a FastAPI Python app connected to PostgreSQL.
What pattern do you see in these errors? What is most likely causing them?
What should I check first?
Interpreting monitoring alerts:
I received this Datadog alert:
"CPU usage on production-api-01 is at 97% for the last 10 minutes"
The application is a Node.js Express API.
What are the most common causes of this pattern and what do I investigate first?
How do I safely reduce load while debugging?
Writing incident postmortems:
Write a clear incident postmortem for the following event.
Timeline:
- 14:30: Deployment of v2.3.1 completed successfully
- 14:45: First user reports of slow page loads
- 15:00: Error rate reaches 25%, team alerted
- 15:15: Identified cause: new database query missing index
- 15:30: Hotfix deployed, index added
- 15:35: System recovered
Cause: Missing database index on orders.user_id column after schema change.
Impact: 30% of users affected, 45 minutes degraded service.
Write a blameless postmortem with: summary, timeline, root cause, contributing factors, and 3 action items.
💰 Free Deal: Claude's free tier handles incident analysis well. Keep a browser tab open to claude.ai during your on-call rotation — being able to ask "what does this error mean?" at 3 AM without a paid subscription is genuinely valuable.
Tips for AI DevOps Usage
Always validate generated configurations. Run a linter or dry-run before deploying. terraform plan, kubectl apply --dry-run=client, and YAML validators catch issues AI generates.
Pin versions explicitly. AI-generated configs often use latest tags. Always replace with specific version numbers for reproducibility.
Review security considerations. AI generates functional configurations but may not always follow security best practices. Always review for: exposed secrets, overly permissive IAM roles, unencrypted storage, missing network policies.
Build a prompt library. Save prompts that generated good results for your specific stack. Reuse them for similar tasks.
Official Resources
- GitHub Actions Documentation — Full Actions reference
- Docker Documentation — Dockerfile best practices
- Terraform Documentation — IaC reference
- Azure DevOps India — Azure DevOps for Indian teams
- CNCF Cloud Native Landscape — DevOps tool landscape reference
Community Questions
0No questions yet. Be the first to ask!