

Incident Response Maturity: Leveraging Tech Proactively
Strengthen your incident response with observability, AI, and automation
KubeCon + CloudNativeCon North America 2025 in Atlanta felt different.
Last year, everyone was demoing “AI assistants for your Kubernetes cluster” – lots of chat, not a lot of "we do this". This year, AI wasn’t just answering questions. It was taking actions:
If you care about reliability, that’s both exciting and mildly terrifying.
At Rootly, we live at the intersection of incidents, automations, and humans trying to get some sleep. Here’s how we’d summarize the real reliability lessons from KubeCon 2025, filtered through an on-call lens.
Kubernetes already had “operators” as a concept. Now AI wants that job title too.
A lot of demos looked roughly like this flow:
checkout-service from v42 → v41”k8s-gpu-17 and reschedule pods; suspected hardware issue”:thumbsup:.The keyword there is policy.
SREs used to write runbooks like:
“If 5xx > 5% for 5 minutes, try rolling back the last deployment.”
Now the work looks more like:
# ai-change-policy.yaml
service: checkout
allowed_actions:
- type: rollback
max_versions_back: 1
require_approval: true
approvers:
- team: checkout-oncall
- type: scale
min_replicas: 3
max_replicas: 50
require_approval: false
safety_checks:
- name: must_not_reduce_availability_below
slo: checkout-availability
threshold: 99.0
logging:
audit: true
incident_tag: "ai-initiated-change"Instead of telling a human how to fix things line-by-line, you’re telling an AI operator what it’s allowed to touch and under which conditions.
What this means for reliability
This is exactly where Rootly-style audit trails, incident timelines, and approval flows become non-negotiable. If a robot is going to touch prod, you’d better be able to answer “who did what and why?” in under 10 seconds.
Once again, KubeCon was full of platform engineering stories. The change this year: those platforms are being rebuilt around AI workloads and reliability, not just “make devs happy.”
The emerging pattern:
Historically, SLOs lived in Prometheus/Grafana and incidents lived in a random Notion folder. Now teams are wiring them directly into the platform API.
For example, instead of:
“Oh, we should probably define SLOs for this service someday.”
You get:
# service.yaml
name: checkout
owner: team-checkout
runtime: "kubernetes"
slo:
availability:
objective: 99.9
window: "30d"
alert_policies:
- type: burn_rate
fast_burn: "1h > 14d"
slow_burn: "6h > 30d"
incident:
channel: "#incidents-checkout"
runbook: "https://runbooks.internal/checkout"
severity_mapping:
- condition: "slo_breach"
severity: "SEV-2"Now when you spin up a service, you must define:
No SLO? No owner? No deploy.
What this means for reliability
“Chat with your logs” is cute. “Let AI drive your incident response using everything observability knows” is useful.
The more serious KubeCon stories were about:
Imagine this in a real incident:
AI: “Error rate for checkout increased 8x starting 02:07.
Correlated change: new config applied to rate-limiter at 02:06.
Similar incident: SEV-1 on 2025-01-14. Mitigation last time: revert config and flush cache.”
That’s not science fiction. It’s mostly a question of data plumbing and how good your past incident/retrospective data is.
AI can’t magically fix:
custom_metric_17)If you want AI to be helpful during an outage, you need:
What this means for reliability
A lot of hallway conversations in Atlanta were about a simple pain: “Our GPUs are either idle or on fire, and both are expensive.”
Running AI workloads on Kubernetes at scale introduces new reliability dimensions:
So instead of just:
slo:
availability: 99.9You’ll start to see:
slo:
availability: 99.9
latency_p95_ms: 800
ttfb_p95_ms: 300
gpu_queue_time_p95_s: 10
feature_freshness_max_delay_s: 60
What this means for reliability
If this sounds like overkill, look at how many AI-native companies are already tagging incidents with things like model=“gpt-foo” and tier=“realtime”. That’s where the world is heading.
Giving an AI agent kubectl access is… a choice.
A recurring KubeCon theme: you can’t separate security, governance, and reliability once autonomous agents are in the mix.
New failure modes you’ll see:
Preventing this isn’t just about “better AI.” It’s about:
Something like:
actor: ai-operator/orca-ops
time: "2025-04-03T02:13:21Z"
action: "rollback"
resource: "deployment/checkout-service"
from_version: "v42"
to_version: "v41"
reason: "slo_burn_rate_exceeded"
incident_id: "INC-2025-042"
approval:
required: true
approved_by: "@oncall-sre"
What this means for reliability
Here’s how we’d turn the KubeCon noise into a concrete plan.
Pick one or two low-risk actions for a single service, for example:
Define:
Start narrow. Widen the radius slowly.
If you maintain a platform:
Your future self will thank you when you’re trying to debug a 3am incident and everything has a clear owner and SLO.
Boring but huge:
You’re not just writing history; you’re training the next generation of AI responders.
For your AI-heavy paths, define SLIs like:
Wire those into your existing SLO / incident workflow. A model with stale features is just as much a reliability problem as a 500.
Before you let an AI touch prod, answer:
If you can’t write that down in a simple policy file, you’re not ready to give it real power.
This isn’t a stealth pitch, but it’s the lens we see the world through:
That’s exactly the space Rootly lives in: turn chaos into structured data and repeatable, automated workflows — whether the actor is a human, a bot, or something that just got announced on the KubeCon keynote stage.