AI for DevOps, Infra & Observability 2026
Terraform/Ansible generation, k8s manifests, PagerDuty + Claude incident chat, SRE runbooks
Last updated: April 19, 2026
SRE and DevOps work is mostly configuration, logs, and decisions under time pressure. AI is uniquely good at the first two and useful (with guardrails) for the third. This guide shows how Indian platform teams are using Claude Opus 4.7, Cursor Agent Mode, and the Claude Agent SDK across the full DevOps surface — IaC generation, Kubernetes, incident command, log analysis, and runbooks.
If you are new to AI for DevOps, start with AI for DevOps 2026. This guide goes deeper on infra and observability specifically.
Key Takeaways
- IaC generation is the easy win — Opus 4.7 writes accurate Terraform, Ansible, and Pulumi with a good spec.
- Kubernetes manifests generate cleanly when you provide cluster context (labels, namespace, storage class).
- PagerDuty + Claude Agent SDK lets you wire AI into incident response without giving it
kubectl applyon prod. - Log analysis is where AI pays back on-call cost most — summarise, correlate, rank hypotheses.
- Runbooks should be generated from post-incident reviews, stored as Markdown, and indexed by your agent.
The AI-DevOps Stack
+------------------------------------------+
| Incident channel (Slack / Teams) |
+------------------------------------------+
| ^
v |
+----------------+ +--------------------+
| PagerDuty hook | | Claude Agent SDK |
+----------------+ | (read-only tools) |
| +--------------------+
v ^
+------------------+ |
| Metrics / Logs |-----------+
| (Prom, Loki, DD) |
+------------------+
^
|
+------------------+
| IaC repo | Terraform, Ansible, k8s
| (Git + OPA) |
+------------------+
The key architectural rule: agents read widely, write narrowly. Read-only access to metrics, logs, and the IaC repo. Write access only to a proposal channel or a PR, never direct to production.
Part 1: IaC Generation with Opus 4.7
Terraform
Terraform is verbose and opinionated. AI does well when you give it one resource at a time with explicit context.
# Cursor Agent Mode prompt
claude-code --effort high "Create a Terraform module for an AWS
RDS PostgreSQL 16 instance in ap-south-1 (Mumbai). Requirements:
- Multi-AZ for production, single AZ for staging (via var.env)
- 200 GB gp3 storage, encrypted with customer-managed KMS
- Private subnet only, security group allows 5432 from ECS task role
- Backup retention 30 days prod, 7 days staging
- Output the connection string format (no credentials)
Use Terraform 1.9+ syntax with locals for tags."
Expected output is a module with main.tf, variables.tf, outputs.tf, and versions.tf. Run:
terraform fmt && terraform validate && terraform plan -out=plan.tfplan
# Then OPA/Sentinel policy check
opa eval -i plan.tfplan -d policies/
Never skip the policy check on AI-generated IaC.
Ansible
Ansible playbooks generate well because the YAML structure is constrained.
# Prompt: "Write an Ansible playbook to harden a fresh Ubuntu 24.04 LTS VM
# on AWS ap-south-1 — CIS Level 1 baseline, fail2ban, automatic security
# updates, UFW with SSH-only, audit logging to CloudWatch."
# AI output (excerpt):
- name: Harden Ubuntu 24.04 baseline
hosts: all
become: true
tasks:
- name: Install security packages
apt:
name:
- fail2ban
- unattended-upgrades
- auditd
- ufw
state: present
update_cache: yes
- name: Configure UFW default deny incoming
ufw:
default: deny
direction: incoming
- name: Allow SSH from trusted CIDR
ufw:
rule: allow
port: 22
proto: tcp
src: "{{ trusted_cidr }}"
- name: Enable UFW
ufw:
state: enabled
logging: low
Always test on a disposable VM first. Cursor's agent mode can do this in a sandbox if you give it ephemeral AWS credentials.
Pulumi
Pulumi (TypeScript or Python) produces cleaner AI output than Terraform because the model can use real type hints. Example:
// Prompt: "Pulumi TypeScript for a three-tier app: ALB -> ECS Fargate
// -> RDS Postgres, all in a dedicated VPC in ap-south-1."
import * as aws from "@pulumi/aws";
import * as awsx from "@pulumi/awsx";
const vpc = new awsx.ec2.Vpc("app-vpc", {
cidrBlock: "10.20.0.0/16",
numberOfAvailabilityZones: 2,
subnetStrategy: "Auto",
tags: { Project: "app", Env: "prod" },
});
const cluster = new aws.ecs.Cluster("app-cluster", {
settings: [{ name: "containerInsights", value: "enabled" }],
});
// ...ALB, service, DB follow
Part 2: Kubernetes Manifest Generation
Give the agent your cluster context file first:
# .cluster-context.yaml (check this into your infra repo)
cluster:
name: prod-ap-south-1
region: ap-south-1
kubernetes: 1.31
namespaces:
default_labels:
team: platform
cost_center: engineering
storage_classes:
default: gp3-encrypted
fast: io2-encrypted
ingress:
class: nginx
cert_manager: letsencrypt-prod
defaults:
resource_limits:
cpu: 500m
memory: 512Mi
security_context:
runAsNonRoot: true
readOnlyRootFilesystem: true
Then:
claude-code "Generate a k8s Deployment, Service, and Ingress for a
Next.js 15 app. Image: ghcr.io/org/app:v1.2.3. Port: 3000.
Replicas: 3 with HPA 3-10 on 70% CPU.
Host: app.company.in. Use cluster defaults from .cluster-context.yaml."
The agent produces manifests that respect your defaults. Run:
kubectl apply --dry-run=client -f generated/ && \
kubeconform -summary generated/ && \
kube-score score generated/
All three should be part of your CI pipeline before any AI-generated manifest reaches the cluster.
Part 3: Incident Response with PagerDuty + Claude
The pattern: PagerDuty webhook fires a serverless function, which spins up a Claude agent with read-only tools and posts its findings back to the incident channel.
Webhook handler
# Lambda / Cloud Run handler
import os
from claude_agent_sdk import ClaudeAgent
from slack_sdk import WebClient
async def handle_pagerduty(event):
incident = event["incident"]
agent = ClaudeAgent(
model="claude-opus-4-7",
tools=[
prometheus_query_tool, # read-only
loki_log_search_tool, # read-only
github_recent_deploys, # read-only
k8s_describe_pods, # read-only; NO apply/exec
],
approval_policy={"any_write": "always"}, # block writes
)
summary = await agent.run(
task=f"""Incident: {incident['title']}
Service: {incident['service']['summary']}
Urgency: {incident['urgency']}
1. Pull the last 15 minutes of metrics and logs.
2. List deploys in the last 6 hours.
3. Correlate. Rank the top 3 likely causes.
4. Draft a 3-line status for the incident channel.
5. Link the relevant runbook if one exists.
Do NOT propose commands to run. Humans execute.""",
max_steps=30,
)
slack = WebClient(token=os.environ["SLACK_TOKEN"])
slack.chat_postMessage(
channel=incident["channel"],
text=summary,
blocks=format_incident_blocks(summary),
)
In practice this cuts mean time to detect root cause by 30-50% on familiar incident types. It is useless on novel incidents — but that is what humans are for.
What NOT to give the incident agent
- Write access to production — never.
kubectl execor SSH — no.- Ability to post to customer-facing status pages without human approval.
- Credentials for any system it can change.
Read-only is the whole guardrail.
Part 4: Log Analysis
AI is strongest on logs when you narrow the question.
# Weak prompt
"Why is the app slow?"
# Strong prompt
"Given these 5 minutes of Loki logs for service=api,
identify the top 3 error patterns by frequency, the
likely endpoint causing each, and the first timestamp
where error rate crossed 1%. Do not speculate on causes."
With the Claude Agent SDK memory tool, you can teach the agent your known log patterns across sessions:
# First-time setup — the agent learns and remembers
await agent.run(task="""
Remember these log patterns for service=api:
- 'pool exhausted' usually means RDS connection leak in /orders endpoint
- 'DEADLINE_EXCEEDED' usually means Redis eviction under load
- 'circuit open' is the Kong upstream to /payments
Store these under memory key 'logs/api/known_patterns'.
""")
# Subsequent sessions benefit without re-teaching
Part 5: Runbooks
Runbooks should be generated from post-incident reviews, not written from scratch. After every P1/P2 incident:
claude-code "Read incident-2026-04-18-api-outage.md and the
Slack timeline attached. Produce a runbook following our
template in docs/runbooks/TEMPLATE.md. Include:
- Symptom fingerprint (3-5 log patterns)
- First-response checks (metrics dashboards, recent deploys)
- Escalation criteria
- Rollback steps
- Mitigation steps with explicit commands
- Post-incident cleanup"
Then store it in a Markdown-indexed repo. The incident agent from Part 3 can search it in future incidents.
Tool Comparison
| Task | Cursor + Opus 4.7 | Claude Code CLI | Copilot in VS Code | Gemini CLI | |------|-------------------|-----------------|--------------------|------------| | Terraform module generation | Best (iterative) | Good (batch) | Good (inline) | Good | | Ansible playbooks | Best | Good | Good | OK | | k8s manifests | Best | Good | Best (inline) | OK | | Incident chat integration | Via plugin | Via SDK (best) | Limited | Limited | | Log analysis | OK | Best | OK | Good | | Runbook generation | Good | Best | OK | OK | | Policy / OPA rule writing | Good | Good | OK | OK |
For inline DevOps edits inside VS Code, GitHub Copilot Free Setup gets you started in minutes.
Security Guardrails
Five non-negotiables for AI in DevOps pipelines:
- No writes without approval. Agents propose, humans apply.
- Secrets never reach the model. Use a gateway that redacts known token patterns before prompts go out.
- OPA/Sentinel on every plan. AI-generated IaC goes through the same policy checks as human-written IaC.
- Audit trail. Log every agent action — input, output, approver, timestamp — to an immutable store.
- Cost cap. Set a daily token budget per agent. A runaway incident agent can burn $200 in an hour.
The Cost Reality for India Teams
Typical monthly spend for a 5-person platform team using AI across IaC, incidents, and runbooks:
- Opus 4.7 via API: $40-80/month for assistant tasks
- Cursor Pro seats: 5 x $20 = $100/month (~8,400 INR)
- Claude Code CLI (pay-per-token): $30-60/month
- PagerDuty-Claude integration: compute cost negligible (under $5/month on Lambda)
Compared to the hours saved on incident response, runbook upkeep, and manifest toil, ROI is typically positive within the first month.
Where to Go Next
- AI for DevOps 2026 — GitHub Actions, Docker, and CI/CD basics
- Claude Code Skills & Superpowers — write custom runbook-aware skills
- Cursor IDE Tutorial India — if you are new to Cursor
- MCP Servers Tutorial — build custom tools for your incident agent
- GitHub Copilot Free Setup — free IDE access for students and OSS maintainers
- Agentic dev: building multi-step agents — production agent patterns that apply here
Community Questions
0No questions yet. Be the first to ask!