Can AI write production-ready Terraform?

Yes, but with guardrails. Opus 4.7 produces working Terraform 80-90% of the time for standard resources. Wrap every AI-generated module with `terraform plan`, a policy check (OPA/Sentinel), and a human review gate before apply. Never let an agent run `terraform apply` on production unattended.

How do I integrate Claude with PagerDuty for incident response?

Use PagerDuty's webhook-triggered workflows to fire a Claude Agent SDK runtime that reads the incident context, pulls logs and metrics, proposes hypotheses, and posts them to the incident channel. The agent should never auto-execute remediation on prod — only propose and link to runbooks.

Which AI tool is best for k8s manifest generation?

Cursor with Opus 4.7 for iterative work, Claude Code for scripted batch generation, and GitHub Copilot for inline edits in existing manifests. All three produce valid YAML; the differentiator is how cleanly they respect your cluster conventions (naming, labels, resource limits).

Can AI actually help during a production incident at 3 AM?

Yes, for four things: summarising the current state from logs and metrics, listing recent changes (CI/CD, infra diffs), proposing hypotheses ranked by likelihood, and drafting the customer-facing status page update. It should not execute remediation without a human approver.

How do I prevent AI from leaking secrets in IaC prompts?

Three layers: (1) scrub secrets from repos with git-secrets or Gitleaks before the agent can read them; (2) use variables and `.tfvars` patterns so the agent works with placeholder names; (3) route all prompts through a gateway that redacts known patterns (AWS keys, tokens) with a regex layer before they reach the model.

AI for DevOps, Infra & Observability 2026 — India Guide 2026

AI for DevOps, Infra & Observability 2026

Terraform/Ansible generation, k8s manifests, PagerDuty + Claude incident chat, SRE runbooks

Last updated: April 19, 2026

SRE and DevOps work is mostly configuration, logs, and decisions under time pressure. AI is uniquely good at the first two and useful (with guardrails) for the third. This guide shows how Indian platform teams are using Claude Opus 4.7, Cursor Agent Mode, and the Claude Agent SDK across the full DevOps surface — IaC generation, Kubernetes, incident command, log analysis, and runbooks.

If you are new to AI for DevOps, start with AI for DevOps 2026. This guide goes deeper on infra and observability specifically.

Key Takeaways

IaC generation is the easy win — Opus 4.7 writes accurate Terraform, Ansible, and Pulumi with a good spec.
Kubernetes manifests generate cleanly when you provide cluster context (labels, namespace, storage class).
PagerDuty + Claude Agent SDK lets you wire AI into incident response without giving it kubectl apply on prod.
Log analysis is where AI pays back on-call cost most — summarise, correlate, rank hypotheses.
Runbooks should be generated from post-incident reviews, stored as Markdown, and indexed by your agent.

The AI-DevOps Stack

+------------------------------------------+
| Incident channel  (Slack / Teams)        |
+------------------------------------------+
         |                     ^
         v                     |
+----------------+   +--------------------+
| PagerDuty hook |   | Claude Agent SDK   |
+----------------+   | (read-only tools)  |
         |           +--------------------+
         v                     ^
+------------------+           |
| Metrics / Logs   |-----------+
| (Prom, Loki, DD) |
+------------------+
         ^
         |
+------------------+
| IaC repo         |  Terraform, Ansible, k8s
| (Git + OPA)      |
+------------------+

The key architectural rule: agents read widely, write narrowly. Read-only access to metrics, logs, and the IaC repo. Write access only to a proposal channel or a PR, never direct to production.

Part 1: IaC Generation with Opus 4.7

Terraform

Terraform is verbose and opinionated. AI does well when you give it one resource at a time with explicit context.

# Cursor Agent Mode prompt
claude-code --effort high "Create a Terraform module for an AWS 
RDS PostgreSQL 16 instance in ap-south-1 (Mumbai). Requirements:
- Multi-AZ for production, single AZ for staging (via var.env)
- 200 GB gp3 storage, encrypted with customer-managed KMS
- Private subnet only, security group allows 5432 from ECS task role
- Backup retention 30 days prod, 7 days staging
- Output the connection string format (no credentials)
Use Terraform 1.9+ syntax with locals for tags."

Expected output is a module with main.tf, variables.tf, outputs.tf, and versions.tf. Run:

terraform fmt && terraform validate && terraform plan -out=plan.tfplan
# Then OPA/Sentinel policy check
opa eval -i plan.tfplan -d policies/

Never skip the policy check on AI-generated IaC.

Ansible

Ansible playbooks generate well because the YAML structure is constrained.

# Prompt: "Write an Ansible playbook to harden a fresh Ubuntu 24.04 LTS VM
#  on AWS ap-south-1 — CIS Level 1 baseline, fail2ban, automatic security 
#  updates, UFW with SSH-only, audit logging to CloudWatch."

# AI output (excerpt):
- name: Harden Ubuntu 24.04 baseline
  hosts: all
  become: true
  tasks:
    - name: Install security packages
      apt:
        name:
          - fail2ban
          - unattended-upgrades
          - auditd
          - ufw
        state: present
        update_cache: yes

    - name: Configure UFW default deny incoming
      ufw:
        default: deny
        direction: incoming

    - name: Allow SSH from trusted CIDR
      ufw:
        rule: allow
        port: 22
        proto: tcp
        src: "{{ trusted_cidr }}"

    - name: Enable UFW
      ufw:
        state: enabled
        logging: low

Always test on a disposable VM first. Cursor's agent mode can do this in a sandbox if you give it ephemeral AWS credentials.

Pulumi

Pulumi (TypeScript or Python) produces cleaner AI output than Terraform because the model can use real type hints. Example:

// Prompt: "Pulumi TypeScript for a three-tier app: ALB -> ECS Fargate 
// -> RDS Postgres, all in a dedicated VPC in ap-south-1."

import * as aws from "@pulumi/aws";
import * as awsx from "@pulumi/awsx";

const vpc = new awsx.ec2.Vpc("app-vpc", {
    cidrBlock: "10.20.0.0/16",
    numberOfAvailabilityZones: 2,
    subnetStrategy: "Auto",
    tags: { Project: "app", Env: "prod" },
});

const cluster = new aws.ecs.Cluster("app-cluster", {
    settings: [{ name: "containerInsights", value: "enabled" }],
});

// ...ALB, service, DB follow

Part 2: Kubernetes Manifest Generation

Give the agent your cluster context file first:

# .cluster-context.yaml (check this into your infra repo)
cluster:
  name: prod-ap-south-1
  region: ap-south-1
  kubernetes: 1.31
namespaces:
  default_labels:
    team: platform
    cost_center: engineering
storage_classes:
  default: gp3-encrypted
  fast: io2-encrypted
ingress:
  class: nginx
  cert_manager: letsencrypt-prod
defaults:
  resource_limits:
    cpu: 500m
    memory: 512Mi
  security_context:
    runAsNonRoot: true
    readOnlyRootFilesystem: true

Then:

claude-code "Generate a k8s Deployment, Service, and Ingress for a 
Next.js 15 app. Image: ghcr.io/org/app:v1.2.3. Port: 3000. 
Replicas: 3 with HPA 3-10 on 70% CPU. 
Host: app.company.in. Use cluster defaults from .cluster-context.yaml."

The agent produces manifests that respect your defaults. Run:

kubectl apply --dry-run=client -f generated/ && \
  kubeconform -summary generated/ && \
  kube-score score generated/

All three should be part of your CI pipeline before any AI-generated manifest reaches the cluster.

Part 3: Incident Response with PagerDuty + Claude

The pattern: PagerDuty webhook fires a serverless function, which spins up a Claude agent with read-only tools and posts its findings back to the incident channel.

Webhook handler

# Lambda / Cloud Run handler
import os
from claude_agent_sdk import ClaudeAgent
from slack_sdk import WebClient

async def handle_pagerduty(event):
    incident = event["incident"]
    
    agent = ClaudeAgent(
        model="claude-opus-4-7",
        tools=[
            prometheus_query_tool,   # read-only
            loki_log_search_tool,    # read-only
            github_recent_deploys,   # read-only
            k8s_describe_pods,       # read-only; NO apply/exec
        ],
        approval_policy={"any_write": "always"},  # block writes
    )

    summary = await agent.run(
        task=f"""Incident: {incident['title']}
        Service: {incident['service']['summary']}
        Urgency: {incident['urgency']}
        
        1. Pull the last 15 minutes of metrics and logs.
        2. List deploys in the last 6 hours.
        3. Correlate. Rank the top 3 likely causes.
        4. Draft a 3-line status for the incident channel.
        5. Link the relevant runbook if one exists.
        
        Do NOT propose commands to run. Humans execute.""",
        max_steps=30,
    )

    slack = WebClient(token=os.environ["SLACK_TOKEN"])
    slack.chat_postMessage(
        channel=incident["channel"], 
        text=summary,
        blocks=format_incident_blocks(summary),
    )

In practice this cuts mean time to detect root cause by 30-50% on familiar incident types. It is useless on novel incidents — but that is what humans are for.

What NOT to give the incident agent

Write access to production — never.
kubectl exec or SSH — no.
Ability to post to customer-facing status pages without human approval.
Credentials for any system it can change.

Read-only is the whole guardrail.

Part 4: Log Analysis

AI is strongest on logs when you narrow the question.

# Weak prompt
"Why is the app slow?"

# Strong prompt
"Given these 5 minutes of Loki logs for service=api, 
identify the top 3 error patterns by frequency, the 
likely endpoint causing each, and the first timestamp 
where error rate crossed 1%. Do not speculate on causes."

With the Claude Agent SDK memory tool, you can teach the agent your known log patterns across sessions:

# First-time setup — the agent learns and remembers
await agent.run(task="""
Remember these log patterns for service=api:
- 'pool exhausted' usually means RDS connection leak in /orders endpoint
- 'DEADLINE_EXCEEDED' usually means Redis eviction under load
- 'circuit open' is the Kong upstream to /payments

Store these under memory key 'logs/api/known_patterns'.
""")

# Subsequent sessions benefit without re-teaching

Part 5: Runbooks

Runbooks should be generated from post-incident reviews, not written from scratch. After every P1/P2 incident:

claude-code "Read incident-2026-04-18-api-outage.md and the 
Slack timeline attached. Produce a runbook following our 
template in docs/runbooks/TEMPLATE.md. Include:
- Symptom fingerprint (3-5 log patterns)
- First-response checks (metrics dashboards, recent deploys)
- Escalation criteria
- Rollback steps
- Mitigation steps with explicit commands
- Post-incident cleanup"

Then store it in a Markdown-indexed repo. The incident agent from Part 3 can search it in future incidents.

Tool Comparison

| Task | Cursor + Opus 4.7 | Claude Code CLI | Copilot in VS Code | Gemini CLI | |------|-------------------|-----------------|--------------------|------------| | Terraform module generation | Best (iterative) | Good (batch) | Good (inline) | Good | | Ansible playbooks | Best | Good | Good | OK | | k8s manifests | Best | Good | Best (inline) | OK | | Incident chat integration | Via plugin | Via SDK (best) | Limited | Limited | | Log analysis | OK | Best | OK | Good | | Runbook generation | Good | Best | OK | OK | | Policy / OPA rule writing | Good | Good | OK | OK |

For inline DevOps edits inside VS Code, GitHub Copilot Free Setup gets you started in minutes.

Security Guardrails

Five non-negotiables for AI in DevOps pipelines:

No writes without approval. Agents propose, humans apply.
Secrets never reach the model. Use a gateway that redacts known token patterns before prompts go out.
OPA/Sentinel on every plan. AI-generated IaC goes through the same policy checks as human-written IaC.
Audit trail. Log every agent action — input, output, approver, timestamp — to an immutable store.
Cost cap. Set a daily token budget per agent. A runaway incident agent can burn $200 in an hour.

The Cost Reality for India Teams

Typical monthly spend for a 5-person platform team using AI across IaC, incidents, and runbooks:

Opus 4.7 via API: $40-80/month for assistant tasks
Cursor Pro seats: 5 x $20 = $100/month (~8,400 INR)
Claude Code CLI (pay-per-token): $30-60/month
PagerDuty-Claude integration: compute cost negligible (under $5/month on Lambda)

Compared to the hours saved on incident response, runbook upkeep, and manifest toil, ROI is typically positive within the first month.

Where to Go Next

AI for DevOps 2026 — GitHub Actions, Docker, and CI/CD basics
Claude Code Skills & Superpowers — write custom runbook-aware skills
Cursor IDE Tutorial India — if you are new to Cursor
MCP Servers Tutorial — build custom tools for your incident agent
GitHub Copilot Free Setup — free IDE access for students and OSS maintainers
Agentic dev: building multi-step agents — production agent patterns that apply here

Community Questions

No questions yet. Be the first to ask!

Share this guide

r/developersIndia r/india r/ChatGPT

AI for DevOps, Infra & Observability 2026

Terraform/Ansible generation, k8s manifests, PagerDuty + Claude incident chat, SRE runbooks

Last updated: April 19, 2026

If you are new to AI for DevOps, start with AI for DevOps 2026. This guide goes deeper on infra and observability specifically.

Key Takeaways

IaC generation is the easy win — Opus 4.7 writes accurate Terraform, Ansible, and Pulumi with a good spec.
Kubernetes manifests generate cleanly when you provide cluster context (labels, namespace, storage class).
PagerDuty + Claude Agent SDK lets you wire AI into incident response without giving it kubectl apply on prod.
Log analysis is where AI pays back on-call cost most — summarise, correlate, rank hypotheses.
Runbooks should be generated from post-incident reviews, stored as Markdown, and indexed by your agent.

The AI-DevOps Stack

+------------------------------------------+
| Incident channel  (Slack / Teams)        |
+------------------------------------------+
         |                     ^
         v                     |
+----------------+   +--------------------+
| PagerDuty hook |   | Claude Agent SDK   |
+----------------+   | (read-only tools)  |
         |           +--------------------+
         v                     ^
+------------------+           |
| Metrics / Logs   |-----------+
| (Prom, Loki, DD) |
+------------------+
         ^
         |
+------------------+
| IaC repo         |  Terraform, Ansible, k8s
| (Git + OPA)      |
+------------------+

The key architectural rule: agents read widely, write narrowly. Read-only access to metrics, logs, and the IaC repo. Write access only to a proposal channel or a PR, never direct to production.

Part 1: IaC Generation with Opus 4.7

Terraform

Terraform is verbose and opinionated. AI does well when you give it one resource at a time with explicit context.

# Cursor Agent Mode prompt
claude-code --effort high "Create a Terraform module for an AWS 
RDS PostgreSQL 16 instance in ap-south-1 (Mumbai). Requirements:
- Multi-AZ for production, single AZ for staging (via var.env)
- 200 GB gp3 storage, encrypted with customer-managed KMS
- Private subnet only, security group allows 5432 from ECS task role
- Backup retention 30 days prod, 7 days staging
- Output the connection string format (no credentials)
Use Terraform 1.9+ syntax with locals for tags."

Expected output is a module with main.tf, variables.tf, outputs.tf, and versions.tf. Run:

terraform fmt && terraform validate && terraform plan -out=plan.tfplan
# Then OPA/Sentinel policy check
opa eval -i plan.tfplan -d policies/

Never skip the policy check on AI-generated IaC.

Ansible

Ansible playbooks generate well because the YAML structure is constrained.

# Prompt: "Write an Ansible playbook to harden a fresh Ubuntu 24.04 LTS VM
#  on AWS ap-south-1 — CIS Level 1 baseline, fail2ban, automatic security 
#  updates, UFW with SSH-only, audit logging to CloudWatch."

# AI output (excerpt):
- name: Harden Ubuntu 24.04 baseline
  hosts: all
  become: true
  tasks:
    - name: Install security packages
      apt:
        name:
          - fail2ban
          - unattended-upgrades
          - auditd
          - ufw
        state: present
        update_cache: yes

    - name: Configure UFW default deny incoming
      ufw:
        default: deny
        direction: incoming

    - name: Allow SSH from trusted CIDR
      ufw:
        rule: allow
        port: 22
        proto: tcp
        src: "{{ trusted_cidr }}"

    - name: Enable UFW
      ufw:
        state: enabled
        logging: low

Always test on a disposable VM first. Cursor's agent mode can do this in a sandbox if you give it ephemeral AWS credentials.

Pulumi

Pulumi (TypeScript or Python) produces cleaner AI output than Terraform because the model can use real type hints. Example:

// Prompt: "Pulumi TypeScript for a three-tier app: ALB -> ECS Fargate 
// -> RDS Postgres, all in a dedicated VPC in ap-south-1."

import * as aws from "@pulumi/aws";
import * as awsx from "@pulumi/awsx";

const vpc = new awsx.ec2.Vpc("app-vpc", {
    cidrBlock: "10.20.0.0/16",
    numberOfAvailabilityZones: 2,
    subnetStrategy: "Auto",
    tags: { Project: "app", Env: "prod" },
});

const cluster = new aws.ecs.Cluster("app-cluster", {
    settings: [{ name: "containerInsights", value: "enabled" }],
});

// ...ALB, service, DB follow

Part 2: Kubernetes Manifest Generation

Give the agent your cluster context file first:

# .cluster-context.yaml (check this into your infra repo)
cluster:
  name: prod-ap-south-1
  region: ap-south-1
  kubernetes: 1.31
namespaces:
  default_labels:
    team: platform
    cost_center: engineering
storage_classes:
  default: gp3-encrypted
  fast: io2-encrypted
ingress:
  class: nginx
  cert_manager: letsencrypt-prod
defaults:
  resource_limits:
    cpu: 500m
    memory: 512Mi
  security_context:
    runAsNonRoot: true
    readOnlyRootFilesystem: true

Then:

claude-code "Generate a k8s Deployment, Service, and Ingress for a 
Next.js 15 app. Image: ghcr.io/org/app:v1.2.3. Port: 3000. 
Replicas: 3 with HPA 3-10 on 70% CPU. 
Host: app.company.in. Use cluster defaults from .cluster-context.yaml."

The agent produces manifests that respect your defaults. Run:

kubectl apply --dry-run=client -f generated/ && \
  kubeconform -summary generated/ && \
  kube-score score generated/

All three should be part of your CI pipeline before any AI-generated manifest reaches the cluster.

Part 3: Incident Response with PagerDuty + Claude

The pattern: PagerDuty webhook fires a serverless function, which spins up a Claude agent with read-only tools and posts its findings back to the incident channel.

Webhook handler

# Lambda / Cloud Run handler
import os
from claude_agent_sdk import ClaudeAgent
from slack_sdk import WebClient

async def handle_pagerduty(event):
    incident = event["incident"]
    
    agent = ClaudeAgent(
        model="claude-opus-4-7",
        tools=[
            prometheus_query_tool,   # read-only
            loki_log_search_tool,    # read-only
            github_recent_deploys,   # read-only
            k8s_describe_pods,       # read-only; NO apply/exec
        ],
        approval_policy={"any_write": "always"},  # block writes
    )

    summary = await agent.run(
        task=f"""Incident: {incident['title']}
        Service: {incident['service']['summary']}
        Urgency: {incident['urgency']}
        
        1. Pull the last 15 minutes of metrics and logs.
        2. List deploys in the last 6 hours.
        3. Correlate. Rank the top 3 likely causes.
        4. Draft a 3-line status for the incident channel.
        5. Link the relevant runbook if one exists.
        
        Do NOT propose commands to run. Humans execute.""",
        max_steps=30,
    )

    slack = WebClient(token=os.environ["SLACK_TOKEN"])
    slack.chat_postMessage(
        channel=incident["channel"], 
        text=summary,
        blocks=format_incident_blocks(summary),
    )

In practice this cuts mean time to detect root cause by 30-50% on familiar incident types. It is useless on novel incidents — but that is what humans are for.

What NOT to give the incident agent

Write access to production — never.
kubectl exec or SSH — no.
Ability to post to customer-facing status pages without human approval.
Credentials for any system it can change.

Read-only is the whole guardrail.

Part 4: Log Analysis

AI is strongest on logs when you narrow the question.

# Weak prompt
"Why is the app slow?"

# Strong prompt
"Given these 5 minutes of Loki logs for service=api, 
identify the top 3 error patterns by frequency, the 
likely endpoint causing each, and the first timestamp 
where error rate crossed 1%. Do not speculate on causes."

With the Claude Agent SDK memory tool, you can teach the agent your known log patterns across sessions:

# First-time setup — the agent learns and remembers
await agent.run(task="""
Remember these log patterns for service=api:
- 'pool exhausted' usually means RDS connection leak in /orders endpoint
- 'DEADLINE_EXCEEDED' usually means Redis eviction under load
- 'circuit open' is the Kong upstream to /payments

Store these under memory key 'logs/api/known_patterns'.
""")

# Subsequent sessions benefit without re-teaching

Part 5: Runbooks

Runbooks should be generated from post-incident reviews, not written from scratch. After every P1/P2 incident:

claude-code "Read incident-2026-04-18-api-outage.md and the 
Slack timeline attached. Produce a runbook following our 
template in docs/runbooks/TEMPLATE.md. Include:
- Symptom fingerprint (3-5 log patterns)
- First-response checks (metrics dashboards, recent deploys)
- Escalation criteria
- Rollback steps
- Mitigation steps with explicit commands
- Post-incident cleanup"

Then store it in a Markdown-indexed repo. The incident agent from Part 3 can search it in future incidents.

Tool Comparison

For inline DevOps edits inside VS Code, GitHub Copilot Free Setup gets you started in minutes.

Security Guardrails

Five non-negotiables for AI in DevOps pipelines:

No writes without approval. Agents propose, humans apply.
Secrets never reach the model. Use a gateway that redacts known token patterns before prompts go out.
OPA/Sentinel on every plan. AI-generated IaC goes through the same policy checks as human-written IaC.
Audit trail. Log every agent action — input, output, approver, timestamp — to an immutable store.
Cost cap. Set a daily token budget per agent. A runaway incident agent can burn $200 in an hour.

The Cost Reality for India Teams

Typical monthly spend for a 5-person platform team using AI across IaC, incidents, and runbooks:

Opus 4.7 via API: $40-80/month for assistant tasks
Cursor Pro seats: 5 x $20 = $100/month (~8,400 INR)
Claude Code CLI (pay-per-token): $30-60/month
PagerDuty-Claude integration: compute cost negligible (under $5/month on Lambda)

Compared to the hours saved on incident response, runbook upkeep, and manifest toil, ROI is typically positive within the first month.

Where to Go Next

AI for DevOps 2026 — GitHub Actions, Docker, and CI/CD basics
Claude Code Skills & Superpowers — write custom runbook-aware skills
Cursor IDE Tutorial India — if you are new to Cursor
MCP Servers Tutorial — build custom tools for your incident agent
GitHub Copilot Free Setup — free IDE access for students and OSS maintainers
Agentic dev: building multi-step agents — production agent patterns that apply here

Community Questions

No questions yet. Be the first to ask!

Share this guide

r/developersIndia r/india r/ChatGPT

Key Takeaways

The AI-DevOps Stack

Part 1: IaC Generation with Opus 4.7

Terraform

Ansible

Pulumi

Part 2: Kubernetes Manifest Generation

Part 3: Incident Response with PagerDuty + Claude

Webhook handler

What NOT to give the incident agent

Part 4: Log Analysis

Part 5: Runbooks

Tool Comparison

Security Guardrails

The Cost Reality for India Teams

Where to Go Next

Community Questions

Share this guide

More guides in AI Dev Tools

GitHub Copilot — Free for All Developers

Cursor — 1 Year Pro Free for Students

Windsurf — Unlimited Free AI Completions

You Might Also Like

System Prompts — The Complete Guide

Best System Prompts for Claude, ChatGPT & Gemini (India Use Cases)

AI Workflows with n8n

Key Takeaways

The AI-DevOps Stack

Part 1: IaC Generation with Opus 4.7

Terraform

Ansible

Pulumi

Part 2: Kubernetes Manifest Generation

Part 3: Incident Response with PagerDuty + Claude

Webhook handler

What NOT to give the incident agent

Part 4: Log Analysis

Part 5: Runbooks

Tool Comparison

Security Guardrails

The Cost Reality for India Teams

Where to Go Next

Community Questions

Share this guide

More guides in AI Dev Tools

GitHub Copilot — Free for All Developers

Cursor — 1 Year Pro Free for Students

Windsurf — Unlimited Free AI Completions

You Might Also Like

System Prompts — The Complete Guide

Best System Prompts for Claude, ChatGPT & Gemini (India Use Cases)

AI Workflows with n8n